Thursday, February 27, 2014

Using someone else's data


There was quite a lot of activity yesterday in response to PLOS ONE's announcement regarding its data policy. Most of the discussion I saw concerned rights of use and credit, completeness of data (e.g. the need for stimulus scripts for task-based fMRI) and ethics (e.g. the need to get subjects' consent to permit further distribution of their fMRI data beyond the original purpose). I am leaving all of these very important issues to others. Instead, I want to pose a couple of questions to the fMRI community specifically, because they concern data quality and data quality is what I spend almost all of my time dealing with, directly or indirectly. Here goes.


1.  Under what circumstances would you agree to use someone else's data to test a hypothesis of your own?

Possible concerns: scanner field strength and manufacturer, scan parameters, operator experience, reputation of acquiring lab.

2. What form of quality control would you insist on before relying on someone else's data?

Possible QA measures: independent verification of a simple task such as a button press response encoded in the same data, realignment "motion parameters" below/within some prior limit, temporal SNR above some prior value.


If anyone has other questions related to data quality that I haven't covered with these two, please let me know and I'll update the post. Until then I'll leave you with a couple of loaded comments. I wouldn't trust anyone's data if I didn't know the scanner operator personally and I knew first-hand that they had excellent standard operating procedures, a.k.a. excellent experimental technique. Furthermore, I wouldn't trust realignment algorithm reports (so-called motion parameters) as a reliable proxy for data quality in the same way that chemicals have purity values, for instance. The use of single value decomposition - "My motion is less than 0.5 mm over the entire run!" - is especially nonsensical in my opinion, considering that the typical voxel resolution exceeds 2 mm on a side. Okay, discuss.


UPDATE 13:35 PST

Someone just alerted me to the issue of data format. Raw? Filtered? And what about custom file types? One might expect to get image domain data, perhaps limited to the magnitude images that 99.9% of folks use. So, a third question is this: What data format(s) would you consider (un)acceptable for sharing, and why?

4 comments:

  1. If I used somebody else's data, I would want that person to agree to be a co-author on the resulting paper, with all the responsibilities that entails for validating the way the data was used.

    ReplyDelete
    Replies
    1. Good response. So does this imply you would ignore public repositories of data, e.g. those for the Human Connectome Project, the 1000 Functional Connectomes project and other open source fMRI repositories, where co-authorship is unlikely to be offered?

      Delete
  2. This question is really about the richness of metadata and looking beyond current approaches.If data were to be available, it would come with all the things that we have come to expect from publications and more. (here is a link to an effort on data citation: http://www.force11.org/datacitation) But I would argue that it should also come with it's entire provenance so that both of the above questions essentially become a filtering problem.

    These questions are not really different from what news does one consume, what movies does one see, etc.,. Currently we make these as personal choices, but often based on some measure of dimensions in our head.

    In the context of brain imaging data, this is no different. We want data to match various dimensions of interest. As long as the underlying information are there, we will be able to put appropriate application lens/filter on it. if the filter fails to pass through data, then we don't have information for our application need.

    The key is to realize that different people are going to have different needs. And so we should capture as much metadata as possible (dicom tags, participant demographics, questionnaires, and figure out what pieces are currently not available electronically). And i do believe that for some applications a lot of noisy data is way more useful than small amounts of pristine data.

    in response to co-authorship: this should really not be required in the presence of data citation principles. if data are published they should be citable and hence you have the necessary responsibility behind it.

    ReplyDelete
    Replies
    1. Hi Satra, not sure I follow you. What sort of filters would you apply to fMRI data to get insight into intrinsic quality? There are a huge number of ways to get crappy fMRI data yet very few ways to measure the quality, it seems to me. We do tend to wing it and use our intuition. (More often than not the data is deemed "good" if the expected result materializes, "bad" if it doesn't!) Note that I am not targeting shared fMRI data but fMRI data in general. There doesn't seem to be a good benchmark for quality. Yet there are many ways to get systematic and/or random flaws in fMRI data. Is one supposed to use thousands of data sets and hope that there is signal in there somewhere? Shouldn't we try to weight better data more than worse data? But if so, how?

      Delete