Christina's LIS Rant: Yeah, but could there ever be enough for replication?

Christina's LIS Rant

Tuesday, January 22, 2008

Yeah, but could there ever be enough for replication?

I'm all about open data - I really need to make that point because I'm sure I come off as critical.

We had a great opportunity to discuss some of this at the session at the NC Science Blogging Conference (1/19). Xan mentioned that the data are rarely freely available, if available at all, and
that this is a problem because the data can be used for quality control, and to support new analyses and visualizations. Of course the problems of format can be overcome, and people are working hard on storage problems.

The discussion bit centered around getting scientists to post their data. Bill thought that congress should mandate it -- to which I reacted really negatively. Holy cow- look at the backlash for making the published stuff available! The scooping issue was brought up (and also in the comments on the Scientific American article and in [1] - the d'oh! moment), too, but all of these overlook a bigger problem.

My points in this post are:
- that it is very difficult, if not impossible to replicate many tricky experiments without a lot more information than is found in a journal article (not the data, but meta info)
- hands on transfer of tacit knowledge may be required ([2] says this nicely, but I don't buy a whole lot of other things in [2] so do not read this as an endorsement of the article)
- it is very difficult to retrieve datasets and then reuse them because it's nearly impossible to capture enough data so that they are useful (this has been overcome in some large astro datasets, but they seem different because the instrument is shared and very well documented - please correct me if this is wrong)
- it's difficult to trust someone else's dataset if you don't know their work
- if data sets are embargoed until the PI wrings all of the papers out of them that s/he can, are they still relevant to other researchers?
- at what level would the data be kept? straight from the sensors/instruments? unpacked/processed? graphed, fused, analyzed?

Certainly meta-analyses happen all of the time and some of these use unpublished data (so that becomes interesting when it shows the bias from publishing only large strong positive results). There are also a few papers giving really paltry response rates when authors are asked to provide a pre/post/e/off-print of an article (I want to say like 30%)

If this can be done at all there are certain things that might make it more likely:
- like with gene databases, bottom up, you must submit your stuff to one of these approved repositories if you want to be published in our journal or be a member of our professional society
- repositories have to be disciplinary -- or really smaller than that -- at the research area level so that the metadata can be tailored to the community's needs. Tons of money has gone into information retrieval from gene databases and some of these other databases might be pretty complicated,too.
- data would have to link to author or research group and articles that used the data
- data would have to be useful (like Jean-Claude's spectra instead of just pdf's of spectra) and stay readable - not a proprietary format, or if so, need to migrate to new versions.
- search would have to make sense
- funding and preservation would have to be steady and long term. It's fine if the data sets are free to retrieve, but then where does the overhead get kept?
- send a ping back to the original author when his/her data is being reused

I am so totally not an expert on any of these issues but I seem to be sitting on an inconvenient boundary between the two areas so I feel bound to try to translate a bit, however poorly.

(actually, I just remembered some librarian discussions about the difficulty of data supplements in astro journals... hm, maybe that did get figured out with online? maybe not? still hard to find?)

[1] Birnholtz, J. P., & Bietz, M. J. (2003). Data at work: Supporting sharing in science and engineering. GROUP '03: Proceedings of the 2003 International ACM SIGGROUP Conference on Supporting Group Work, Sanibel Island, Florida, USA. 339-348. DOI: http://doi.acm.org/10.1145/958160.958215
[2] Shapin, S. (1995). Here and everywhere: Sociology of scientific knowledge. Annual Review of Sociology, 21(1), 289-321

Update:
1) discussion wrt Hwang case and peer review and cloning... replication hard and expensive, still want to look at the data see: http://blogs.nature.com/reports/theniche/2007/06/how_can_journals_improve_peer.html
(so the up shot is yes, make data available but not necessarily for the purpose of replication -- which I think everybody pretty much goes along with)
2) new research on peer review, reviewers think it's generally a good idea to review the data but are not terribly excited about doing it themselves, see: summary paper, toward the end http://www.publishingresearch.net/PeerReview.htm
via

Labels: open data, open science

¶ 10:02 PM| |cites (technorati) |

Comments:

Bill thought that congress should mandate it

Fair go. I asked whether congress should mandate it. Consensus seemed to be that we haven't given the "bottom-up" approach enough of a try yet -- unlike OA, where that approach fell flat on its face and has been replaced by a mandate.

I agree that in general, the fewer mandates the better. Congress needs (or should need) a compelling reason (a State interest) to make a new law. Also, there are some examples of research communities making Open Data work (sequence data, some work in array data, cheminformatics, and so on). But, having said all that, I am not confident that open data sharing will ever become standard operating procedure without some coercion. After all, there was an example of a community making OA work for ten years (arXiv), and it didn't help the NIH policy get beyond 5% until they made that a mandate.

I think you are also a bit blinkered when it comes to data -- it's not all about physics. I don't see the relevance of your point about difficulty of replication. With appropriate standards in place, it's easy to retrieve and reuse lots of biological datasets. It's not difficult to trust a dataset -- it's more difficult to trust someone who won't give you that dataset! If datasets are embargoed until the PI thinks she is done, that's still useful -- the PI doesn't know everything, there will be answers in there to questions that she hasn't even thought to ask.

# posted by

Bill Hooker : 1/25/2008 11:53 PM

I am no doubt blinkered when it comes to data - I really don't know much at all about biological datasets. With some other areas there's a lot of tweaking (or the fancy term bricolage is used) to get the bump on the o-scope to appear and so the data (stream? graph? analyzed? fuzed from multiple sensors) won't necessarily capture all of that, but will only capture sort of the end state.
I guess the mandate or not thing may of course depend on discipline (and ArXiv-like data sharing place might be bottom up in physics, but bio might need top down mandates -- very few physicists will be impacted by NIH rules)

# posted by

Christina : 1/26/2008 11:11 AM