At its twice-yearly meetings in Washington, the SIAM Committee on Science Policy welcomes visitors from agencies with programs related to the mathematical sciences. At the November 2014 meeting, Philip E. Bourne, the associate director for data science at the National Institutes of Health, described some NIH programs that could benefit from the participation of mathematical scientists. The following article, by Bourne and his NIH colleague Michelle C. Dunn, is based on his CSP presentation.
The ongoing explosion in the quantity of biomedical data has put the spotlight on the importance of data analysis to the future of biomedicine and healthcare. These data cover all biological scales––from molecules, to cells, to patients, yet they share the properties of sparseness and noise. Given their volume and complexity, the data provide new challenges as well as opportunities for applied mathematicians to engage in the biomedical enterprise. The National Institutes of Health is keen to enable this engagement. What follows are examples of problem areas that can be addressed through the NIH Big Data to Knowledge (BD2K) initiative,* in which the 3 “V”s––volume, velocity, and variety––are all in play. Opportunities abound for all data types with which biomedicine is faced, including those described below.
Clinical data. Based on the American Recovery and Reinvestment Act of 2009, more than $19 billion has been invested in efforts to motivate hospitals and healthcare providers to convert from paper to electronic health records. Inconsistencies in the ways in which EHRs capture structured data (e.g., vital signs and lab results) and unstructured data (e.g., physicians’ free-form text notes) become evident when a large cohort of patients is examined. Yet we are beginning to see promise. Analyzing these data needs rigorous statistical approaches across large, sparse noisy datasets for which feature extraction requires, for example, improved Natural Language Processing (NLP) tools.
Phenotype data. Challenges remain even for data that are routinely collected with a controlled vocabulary and in a standard format. Phenotype data, whether in a clinical or a laboratory setting, pose visualization and analysis challenges. With many more predictors than observations, even linear models are challenging because of drastically underdetermined systems of equations. Complex models are all the more challenging.
Image data. Mathematical problems arise with image data both in processing, via feature mapping, and analysis, through multiple, repeated comparisons. Feature mapping is the extraction from an image of useful summaries that describe the structure of interest, which can be represented as a manifold in a high-dimensional space. Multiple comparisons are made when, for example, hypothesis tests are performed on each pixel of an image to search for anomalies. Radiological image data are a prime example of the obvious benefit of early and reliable detection of anomalies.
Genomic data. High-throughput sequencing technologies will shortly make genomic data a standard feature of an EHR. Genomic data include not only a patient’s genome, but also the genomic composition of the patient’s microbiome. It is estimated that trillions of cells in a human being are not human cells, but rather from bacteria and viruses and other microorganisms. Our extensive microbiome is vital for normal health; change in its composition is a valuable diagnostic tool. New approaches will be required for reliably reconciling genomic data within health records and for performing comparative analyses of such multiscale records.
Network data. Graphs are used to model networks of connections, including those between molecules, genes, neurons, people, care providers, and hospitals. Discovery from large-scale, dynamic networks relies heavily on the field of graph theory. Goals include understanding the evolution of networks and inferring causality through mechanistic modeling.
Streaming data. Longitudinal studies are commonplace, but the granularity with respect to time and the volume of data are changing rapidly. For example, the potential exists for patients to continuously input into their EHRs through the use of mobile devices. As data are measured on increasingly fine time scales, data production will outpace the ability to write the data to disk or to do the desired computations in real time. In these cases, lightweight, real-time, online processing and analysis of streaming data become important, with decisions made after each observation.
These diverse data types are just the beginning. Individually, they produce tough computational and quantitative challenges. Their combination results in even tougher challenges––in data standardization, manipulation, modeling, and analysis. These challenges must be overcome, as advances in the biomedical sciences will become possible through data integration. Data integration includes combining diverse types of data, as well as combining data with prior information from expert knowledge. Integrating expert opinion, particularly important in a clinical setting, could be equally valuable in a biological one.
Once data are combined, the biomedical problem often boils down to inference (drawing conclusions about the population) or prediction (predicting the outcome for a particular individual). Both require a mathematical and biomedical understanding of confidence achieved through an estimation of associated noise. Predictions made without consideration of confidence will lead to unnecessary risks. Currently, many scalable prediction algorithms do not estimate errors associated with the predictions, and many prediction methods for which error is understood are not scalable. Therefore, a challenge is to combine the two in a biomedical context––to produce scalable algorithms that implement principled approaches for prediction and inference through the joint consideration of statistical risk and algorithmic runtime.
Achieving the promise of benefit from digital data requires a team effort. Computational and quantitative scientists are needed to address pressing problems alongside biomedical scientists. Building a collaborative relationship takes time and energy as both partners work to understand a new language and scientific context. The NIH is committed to bringing down barriers to the formation of collaborations between biomedical scientists and computational/quantitative scientists through investments in team science; teams with diverse expertise, including mathematical, have the skills to unlock biomedical discoveries for the benefit of humankind.
* The NIH Big Data to Knowledge initiative was created in 2012 to encourage the development of new approaches, standards, methods, tools, software, and competencies that will enhance the use of biomedical data. It includes support for research, implementation, and training in data science and other relevant fields. Information can be found here.