Yeah, but could there ever be enough for replication?
I'm all about open data - I really need to make that point because I'm sure I come off as critical.
We had a great opportunity to discuss some of this at the session at the NC Science Blogging Conference (1/19). Xan mentioned
that the data are rarely freely available, if available at all, and
that this is a problem because the data can be used for quality control, and to support new analyses and visualizations. Of course the problems of format can be overcome, and people are working hard on storage problems.
The discussion bit centered around getting scientists to post their data. Bill
thought that congress should mandate it -- to which I reacted really negatively. Holy cow- look at the backlash for making the published stuff available! The scooping issue was brought up (and also in the comments on the Scientific American article
and in  - the d'oh! moment), too, but all of these overlook a bigger problem.
My points in this post are:
- that it is very difficult, if not impossible to replicate many tricky experiments without a lot more information than is found in a journal article (not the data, but meta info)
- hands on transfer of tacit knowledge may be required ( says this nicely, but I don't buy a whole lot of other things in  so do not read this as an endorsement of the article)
- it is very difficult to retrieve datasets and then reuse them because it's nearly impossible to capture enough data so that they are useful (this has been overcome in some large astro datasets, but they seem different because the instrument is shared and very well documented - please correct me if this is wrong)
- it's difficult to trust someone else's dataset if you don't know their work
- if data sets are embargoed until the PI wrings all of the papers out of them that s/he can, are they still relevant to other researchers?
- at what level would the data be kept? straight from the sensors/instruments? unpacked/processed? graphed, fused, analyzed?
Certainly meta-analyses happen all of the time and some of these use unpublished data (so that becomes interesting when it shows the bias from publishing only large strong positive results). There are also a few papers giving really paltry response rates when authors are asked to provide a pre/post/e/off-print of an article (I want to say like 30%)
If this can be done at all there are certain things that might make it more likely:
- like with gene databases, bottom up, you must submit your stuff to one of these approved repositories if you want to be published in our journal or be a member of our professional society
- repositories have to be disciplinary -- or really smaller than that -- at the research area level so that the metadata can be tailored to the community's needs. Tons of money has gone into information retrieval from gene databases and some of these other databases might be pretty complicated,too.
- data would have to link to author or research group and articles that used the data
- data would have to be useful (like Jean-Claude's spectra instead of just pdf's of spectra) and stay readable - not a proprietary format, or if so, need to migrate to new versions.
- search would have to make sense
- funding and preservation would have to be steady and long term. It's fine if the data sets are free to retrieve, but then where does the overhead get kept?
- send a ping back to the original author when his/her data is being reused
I am so totally not an expert on any of these issues but I seem to be sitting on an inconvenient boundary between the two areas so I feel bound to try to translate a bit, however poorly.
(actually, I just remembered some librarian discussions about the difficulty of data supplements in astro journals... hm, maybe that did get figured out with online? maybe not? still hard to find?)
 Birnholtz, J. P., & Bietz, M. J. (2003). Data at work: Supporting sharing in science and engineering. GROUP '03: Proceedings of the 2003 International ACM SIGGROUP Conference on Supporting Group Work, Sanibel Island, Florida, USA. 339-348. DOI: http://doi.acm.org/10.1145/958160.958215
 Shapin, S. (1995). Here and everywhere: Sociology of scientific knowledge. Annual Review of Sociology, 21(1), 289-321
1) discussion wrt Hwang case and peer review and cloning... replication hard and expensive, still want to look at the data see: http://blogs.nature.com/reports/theniche/2007/06/how_can_journals_improve_peer.html
(so the up shot is yes, make data available but not necessarily for the purpose of replication -- which I think everybody pretty much goes along with)
2) new research on peer review, reviewers think it's generally a good idea to review the data but are not terribly excited about doing it themselves, see: summary paper, toward the end http://www.publishingresearch.net/PeerReview.htm
Labels: open data, open science