Response to Next-Generation Data Science Challenges in Health and Biomedicine RFI
Text of the Request for Information call: https://grants.nih.gov/grants/guide/notice-files/NOT-LM-17-006.html
Recent years have proven that two components are essential for artificial intelligence breakthroughs: large, rich datasets and advanced algorithms. Advances in machine vision (especially object recognition) are a great example of this. The substantial increase in the accuracy of those systems was only possible due to availability of a large annotated dataset (ImageNet) and improvements to powerful analysis methods (deep neural networks). Biomedical data is heading in the right direction, but progress is stifled by the following major roadblocks:
- Goal: Share more data.
Roadblock: NIH data sharing sharing policies are not enforced, leading to wasted taxpayer money. Even though data management plans are part of the grant review process they are rarely taken seriously by the review panels focused on the scientific aspects of the proposal. Researchers with strong track records of sharing data are not rewarded appropriately in the grant review process.
- Goal: Improve data curation.
Roadblock: Building and maintaining biomedical data repositories is difficult given current NIH funding opportunities, which focus on development of new resources rather than long-term support for existing resources. A major concern of data submitters is long term preservation, which is hard to guarantee with short term grants. Similarly, maintaining a repository (even without developing new features) requires computational resources (storage, web servers etc.) that need to be covered over a period of time that is often longer than typical duration of an R01 grant.
- Goal: Innovative data reuses.
Roadblock: Even though publicly available datasets are reused often by biomedical researchers (Gorgolewski, Wheeler, Halchenko, Poline, & Poldrack, 2015; Milham et al., 2017) they have low penetration of the broader machine learning community that tends to use non medical datasets as benchmarks. Biomedical datasets concerning important questions are often poorly advertised and only available in raw form using file formats that are not commonly used in data science and machine learning.
In attempt to improve this situation we recommend the following interventions:
Recommendation A1: Include data sharing history as compulsory part of the biosketch. It will highlight scientists’ commitment to data sharing, and cement data sharing efforts as a first class citizen among other academic outputs.
Recommendation A2: Make data management plans publicly available. It will lead to more transparency and public accountability. The fact that certain promises regarding data sharing will be public will make researchers more likely to abide to them.
Recommendation A3: Add an explicit “Data and Materials Sharing” criterion score to the grant scoring protocol. This additional dimension should take into account the applicant’s data sharing history (see Recommendation A1) adjusted for seniority. This mechanism will incentivize researchers to put more effort into providing realistic data sharing plans in their grants.
Recommendation B1: Intramural support for long term backup of publicly available data. Many existing field-specific data repositories are struggling to guarantee long term preservation of their records. NLM could help with this by providing a free service allowing affiliated repositories to deposit backup copies of their records. This would increase the chances of preserving those datasets in the long term. Such a service would be distinct from NIH-supported archives such as NDA since it would be provided for public data and without any data curation (assuming that deposited datasets are already curated).
Recommendation B2: Long term grants providing cloud credits for community run repositories and services. Provide a funding mechanism that would subsidize cloud computing costs for public data repositories. This mechanism could be targeted at established repositories with the goal of maintaining their operations in the long term. The grants could come in a form of cloud computing credits or discounts for these services.
Recommendation C1: Creation of benchmark biomedical datasets curated for ease of use in the context of deep neural network applications. One of the most commonly used benchmark datasest for adversarial neural networks is a collection of photographs of celebrities. It is easy to access and work with and thus is the go-to dataset for validating new techniques. The same cannot be said about many publicly available biomedical datasets. There is a great potential in directing the machine learning community towards important biomedical problems, but work needs to be put into curating those datasets for better ease of use by computational scientists who are not necessarily biomedical experts. NIH should issue a set of special calls for grants aimed at creation of widely accessible benchmark datasets or competitions in the space of important biomedical problems.
We believe that implementing this set of practical recommendation will set the NIH on a track of more efficient, cheaper, and more interdisciplinary science. We are happy to discuss these ideas at greater length.
Gorgolewski, K. J., Wheeler, K., Halchenko, Y. O., Poline, J.-B., & Poldrack, R. A. (2015). The impact of shared data in neuroimaging: the case of OpenfMRI.org. F1000Research. https://doi.org/10.7490/f1000research.1110040.1
Milham, M., Craddock, C., Fleischmann, M., Son, J., Clucas, J., Xu, H., … Klein, A. (2017, September 4). Assessment of the impact of shared data on the scientific literature. bioRxiv. https://doi.org/10.1101/183814