Jason Williams and Lindsay Barone (Cold Spring Harbor Laboratory)
September 11, 2018
It is a truth universally acknowledged that a biologist in possession of a data must be in want of a computer to analyze it on. Or, perhaps not. In 2016 as part of our efforts to better understand the needs of users and potential users of CyVerse (NSF-funded cyberinfrastructure for life sciences), we conducted a survey of NSF-funded investigators to determine what was important for them when it comes to analyzing large datasets. Surprisingly, foundational resources like high-performance computers and data storage were dead last in investigator’s ranking of unmet needs.
Over the past ten years of the CyVerse project (originally funded in 2008 as the iPlant Collaborative), we have been aware of life science’s transition towards a data-unlimited paradigm. For those outside of life science data challenges, this article describes data challenges facing the life science community. Consistently, large and complex datasets are important to the majority of investigators. This problem will increase exponentially as high-throughput imaging becomes as cheap, efficient, and portable as genome sequencing and new methods such as Deep Learning need to be integrated into the biologist’s toolkit.
Given the challenges, what the surveyed NSF-funded investigators identified as their most unmet need was training, specifically:
- Training on integration of multiple data types
- Training on data management and metadata
- Training on scaling analyses to cloud/high performance computing
These needs suggest that NSF, universities, and other institutions have done a fantastic job at providing physical computational resources but haven’t provided some of the necessary catalysts for their effective use.
90% of researchers noted software maintenance as an ongoing activity and 95% anticipate this will be a need in the next three years. Half of researchers reported not being able to meet their current needs for updating and maintaining software. Failing to properly manage software hinders scientific progress in significant and even unnoticed ways. Take for example the tremendously popular TopHat software package which continues to accumulate citations despite its author’s pleading with the community to stop using it and switch to faster, more accurate methods. Clearly, we need to do a better job at understanding the role and the life cycle of software in research! Wet lab researchers have trained “instincts” and “smell tests” for troubleshooting at the bench (e.g., checking antibody lot numbers, expired or contaminated reagents, etc.) We need to develop these same instincts for evaluating and using software and hopefully URSSI can help.
In addition to the software itself, training, documentation, and usability are just a few of the human-centered resources that need to be better understood and deployed to address the unmet needs suggested by the survey. It’s not surprising that many efforts to improve computational capacity in the sciences have focused on the physical capabilities – they are foundational (no computation without a computer) and the metrics of success are easier to measure (how many CPUs have been funded, how much data has been stored). As researchers venture into addressing bottlenecks that connect people to the software and compute needed to address a research question, our ability to develop and assess the metrics of success become more difficult and complex.
One hope for URSSI is that it develops and tests hypotheses about how software design, documentation, and training contribute to more effective use of software. While these problems affect researchers in general, we also think life scientists have some unique challenges in software use (and corresponding research methodology). In contrast to some areas of data-intensive physics or astronomy, where teams of researchers coalesce (and develop best practices) for a few large shared instruments, individual biologists are potentially sitting on their own “Large Hadron Collider” or “Very Large Array” telescope. How does software use and adoption work in this context? How and why do inefficient or incorrect methods persist? How can better and more responsible software design curb these issues? So many questions!
While we don’t have the answers now, we look forward to contributing to URSSI and other efforts to learn more about what biologists need and how we can best support them.
Read the paper and see the survey report in PLOS Computational Biology:
Barone L, Williams J, Micklos D (2017) Unmet needs for analyzing biological big data: A survey of 704 NSF principal investigators. PLOS Computational Biology 13(10): e1005755 https://doi.org/10.1371/journal.pcbi.1005755.