How not to get lost in your data

July 7, 2015

Prof. Smith had a brilliant idea – without acquiring any new data he will be able to test his new hypothesis. All he has to do is to get his PhD student to reanalyze the data acquired by his postdoc two years ago. Brilliant! And so cheap!

Everything was rosy until he tried to put his plan into practice. First of all he thought he had all the data – sitting on an external hard drive the postdoc gave him the day before he left. What prof. Smith found on the device was a messy collection of binary files, spreadsheets with column names that only experts from Bletchley Park could decipher and folders called “copy_v2_do_not_touch”. The postdoc was of course unreachable – digging wells in Malawi. It will take weeks to make sense of this data, before the student will be able to start working on the project…

Is this story familiar to you? Either from past experience or current fears? We have seen a number of projects that have struggled with efficient and clear organization of their data. It can not only lead to analysis mistakes, but also makes reusing the data within the lab harder. When it comes to sharing the data with collaborators or the general public, post-hoc data organization can take a significant amount of time. To address this problem, we have worked together with the INCF Neuroimaging Data Sharing Task Force as well as many external experts to put together a comprehensive set of guidelines for organising data. We called it Brain Imaging Data Structure or BIDS in short. We tried to make it simple and intuitive to use mimicking practices already implemented in many labs – such as encoding metadata in folder and file names. We also opted on a file based solution instead of a dedicated database (such as XNAT, Loris, SciTran or NIDB), because of the reality that most analyses in neuroimaging labs occur on a filesystem after the data have been downloaded from a database.

We are also working with the developers of data analysis pipelines (such as Nipype, C-PAC, Automated Analysis etc.) with the goal of making it easier to run high quality automated pipelines on data organized according to BIDS. In addition we are working with existing databases (including Loris and OpenfMRI) to make import of BIDS datasets easier. This will speed up sharing your data when decide to do this, as is increasingly required by funding agencies and journals.

You can learn more about BIDS at where you can find information about the specification itself, validation tools, and most importantly means to give us feedback. We want this standard to fit as many experimental designs as possible, so your comments are very valuable to us!