Recommendations for NIH Data Management Policies — can NIH get out of its own way?

Jason Williams
Towards Data Science
9 min readDec 12, 2018

--

In October 2018, NIH issued a request for information “on Proposed Provisions for a Draft Data Management and Sharing Policy for NIH Funded or Supported Research.” The purpose of this request was to gather the opinions of members of the scientific community on how NIH-funded research should be made available. As per the request:

NIH has a longstanding commitment to making the results and accomplishments of the research that it funds and conducts available to the public. In NIH’s view, data should be made as widely and freely available as possible while safeguarding the privacy of participants and protecting confidential and proprietary data.

It’s critical that the products of funded research are available both to other scientists and the public (who are the funders of this research). A data management plan (DMP) is a component of an application for research funding. DMPs describe how scientific data produced in an investigation will be made available for reuse, verification, and in a way that addresses privacy concerns. Scientists (in this case biomedical researchers) are experts in their domain, but their skills often don’t include training on data management. Lack of skill (and or lack of awareness) in data management is one reason that data availability frequently fails to meet the ideals outlined by NIH (See also Federer et.al. 2018). While publishers typically have a data availability requirement for publication, funders certainly have a role in realizing these standards.

Below is my submission to the NIH request. While I think most of the initial policy draft is sensible, my biggest issue is that I like many others who spend time contributing their thoughts don’t have a clear sense of how these comments actually impact NIH policy. For example, earlier this year NIH also requested input on their Strategic Plan for Data Science. I found the original plans appalling (see thoughts on that here), and my comments generated several threads like this one on Twitter. I don’t think that I’m some kind of uniquely insightful genius here, but the “strategic” data science plan contained many ideas that were at best vague and in several cases plainly mistaken. Many people noted this, but the final plan was largely left untouched by community feedback.

NIH is one of the most remarkable institutions ever created, and as the world’s largest biomedical research organization, it is uniquely positioned to shape the global course of human health. I’ve been privileged to have a small role on a small piece of an NIH project (the NIH Data Commons), which has a mission to harmonize and make accessible valuable NIH data. If this project succeeds, it would certainly accelerate the pace of biomedical discovery. Every day, we see technologies (such as AI and deep learning) enabling insights — and potentially cures — that would have been inconceivable 5 or 10 years ago. Frustrating this potential, is the daunting task of positioning NIH data to be accessible and compatible with new and powerful methods for data mining; policy hasn’t kept pace with science. Seeing the dedication of NIH investigators and program staff, I know progress is possible but my optimism has been typically met with skepticism. I’ve had many conversation (in public and in private) where researchers express doubts that NIH can carry out data science infrastructure projects. The track records of projects like caBIG and Big Data to Knowledge (BD2K) leave open a lot of space for criticism. Worse, many folks seem to think their voices don’t matter in changing the course of NIH policy and politics.

NIH needs to focus on its process for decision making — making it more transparent and responsible to the community. Success rests on realizing that while previous approaches to collecting and acting on feedback might be useful for making scientific decisions, the problems of working with data are actually more social than scientific or technological. After all, one of the most important data innovations of the last 5 years is the concept of FAIR — not a set of technologies, but a set of principles. The idea of a Data Commons is more than hosting a repository of data, but a organization of trust and a commitment to the best science and patient outcomes realized by sharing knowledge.

Hopefully NIH will be able to “get out of its own way” when it comes to working within the 21st century paradigm of big data science — technology will solve problems, but people (who fail to work together as a community) will make it fail.

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

Re: Request for Information (RFI) on Proposed Provisions for a Draft Data Management and Sharing Policy for NIH Funded or Supported Research*

December 10, 2018

Dear Colleagues,

The NIH mission is to “seek fundamental knowledge about the nature and behavior of living systems and the application of that knowledge to enhance health, lengthen life, and reduce illness and disability.” It is highly appropriate that NIH develop data management and sharing policy cognizant of the challenges (and opportunities) brought about unprecedented rapid development of data. Before addressing my specific responses to the Proposed Provisions document I wanted to raise a key issue I have not seen addressed — one which limits NIH’s access to the best possible advice.

NIH mechanisms for collecting feedback in the areas of data practice have generated skepticism about the organization’s willingness or ability to act upon community recommendations. Since the release of the NIH Strategic Plan for Data Science in June 2018, my own personal experience is that investigators — specifically from communities of bioinformatics, computational biology, and data science have expressed disappointment that comments made did not result in meaningful improvement in the final plan. Clearly, NIH action should not be based on my personal anecdotes. However, while NIH has been diligent about soliciting comments (and crucially, posting them publicly) there is no sense from investigators I’ve talked to, or from a reading of NIH announcements, just how final decisions will be made. In the absence of a more transparent process, the most cynical reading of this exercise is that NIH can ignore community recommendations and proceed to follow a pre-determined agenda. If investigators doubt NIH’s commitment, many of the comments NIH needs to hear most will never be offered. I speak of NIH as a whole because even though actions of individual institutes may be their own, they all contribute to the community’s judgment of faith and good will in these processes.

My primary recommendation is that NIH go beyond disclosing the comments made to this and other RFIs by also making available to the community a public and complete explanation of how recommendations are acted on and for what causes. For several reasons elaborated in my previous response**, NIH is at a ‘structural’ disadvantage in developing data-related policies. Ideally NIH would suggest a reasonable implementation plan ahead of an RFI describing how recommendations would be vetted and implemented and follow through on their commitment to be responsive to community recommendations.

Unfortunately, large-scale initiatives to support the data and computational underpinnings of NIH-sponsored science have not had a track record at success (e.g. caBIG, BD2K, etc.). I still have high-hopes for the Data Commons (if it can progress towards being community-driven for the benefit of NIH, not NIH-entangled for the benefit of no one). Without doing more to be responsive and open, NIH risks becoming irrelevant as a policy shaper and will fade into a reactionary position as data generated domestically and globally dwarf projects directly funded by NIH. I hope NIH decides to think about how think outside the box (or the NIH campus) to build community consensus through a more transparent and accountable process. I think I see thing happening and hopeful for further assurance.

SPECIFIC COMMENTS

The definition of Scientific Data

Software is a concept missing from the definition of scientific data. In one sense, software is a form of metadata. For example, in a wet-lab experiment, a description of a cell line or antibody lot might be a crucial descriptor of a dataset. Often, scientific data is uniquely wedded to the software used to produce that data — from base calling software which may be involved in the production of “raw” sequence data, to the software used to produce any of several downstream analysis products. The definition of scientific data could benefit from explicit acknowledgment of this unique relationship.

Related to software, provenance is also a concept missing from the proposed plan. Scientific data is characterized by its life cycle (see one elaboration at https://www.dataone.org/data-life-cycle). In particular, the need for constant versioning and updating merits reflecting on this in the definition of “scientific” data as a dynamic concept.

The requirements for Data Management and Sharing Plans

Regarding “Related Tools Software and/or Code”:

- For any software/code used (but not developed) in an analysis, it should a requirement that a full description of the software be provided including version numbers, links to source code/binaries, etc.

- As indicated, the NIH should strongly encourage the use of open-source software that is freely available. As funders and institutes are migrating to all open-access publication, it may be worthwhile to consider how NIH can progress to implementing requirements for the usage of open software and data formats.

- I agree with the recommendation that where proprietary software must be used, there is explanation provided.

- It should be a further recommendation that software used in analysis be fully-documented for example by making available version-controlled scripts, makefiles or workflow language descriptions, etc. Where possible, investigators should make use of modern reproducibility approaches such as containers (e.g. Docker, Singularity), virtual machine images, etc. Documentation should follow recommendations that increase the reproducibility of analysis such as minimum information standards (e.g. minimum reporting guidelines for biological and biomedical investigations; http://www.nature.com/nbt/journal/v26/n8/full/nbt.1411.html, or more updated recommendations being produced by GA4GH, Research Data Alliance, etc.).

- Where any scripts or other software is developed as part of an investigation, this must be accompanied by an appropriate open-source license and available in a public repository upon submission of any pre-print and/or by the time of submission for peer-review. Any code/software should be available by the end of funding regardless of publication status. The same recommendation on version controls and containerization apply here as well.

Regarding “Data Preservation and Access”:

- There is no comment on how investigators should address data that does not need to be kept. There should be a clear description of how investigators determine what intermediate or derivative data products do not require preservation. There may also need to be specific recommendations for documenting how sensitive data will be discarded — but these may be sufficiently addressed by other legal and policy requirements patient data.

- Although unique identifiers are addressed, there should be more specific guidance on where identifiers should be required, and what defines appropriate long-term storage solutions.

- There should be a description of how sharing will be achieved for large (> 1GB?) datasets. For very large data sets, sharing becomes increasingly difficult. While there may not be an obligation for the researcher to make every dataset equally available, it should be possible to characterize (and perhaps NIH could implement a scoring system) that allows classification of data sharing. FAIR metrics projects being developed within the NIH Data Commons and elsewhere are already working on these objectives.

Regarding “Data Preservation and Access Timeline”:

- That data funded by NIH (with the exception of protected records) must be made available should be made explicit.

The optimal timing, including possible phased adoption, for NIH to consider in implementing various parts of a new data management and sharing policy and how possible phasing could relate to needed improvements in data infrastructure, resources, and standards.

- I agree with the document’s suggestion that for extramural grants, the DMP could be evaluated as acceptable/unacceptable and as part of an Additional Review Consideration.

- The NIH needs to invest in training and/or learning materials on how to effectively critique a data management plan. In general, it is likely that a substantial number of reviewers in a study section or other panel (or program officers) may not have training in the practices of computational sciences or data management. As such, they may not be the most effective adjudicators of a data management plan. NIH could increase the quality of its review by offering training. Groups such as Data Carpentry, DataONE, etc. have training material/curriculum that could be applicable here.

- The NIH should invest in training, including the development of learning resources in order to assist investigators with development and execution of data management plans.

Additional Recommendations

- The data management/policy landscape is very dynamic. NIH should implement an annual review of its guidance.

- Guidance on the data management policy should directly solicit advice from recognized organizations including the Research Data Alliance, GA4GH, ELIXIR, Force11 and others. A formal advisory mechanism here may be appropriate.

Sincerely,

Jason Williams

Cold Spring Harbor Laboratory

* This response represents my only my own personal opinions

** See my comments on the NIH Data Plan and that of other community members here: https://github.com/JasonJWilliamsNY/2018_nih_datascience_rfi

--

--