The Molecular Sciences Software Institute: The Time is Ripe for Change

T. Daniel Crawford and Daniel G. A. Smith

February 6, 2019

We are witnessing the early stages of a revolution in the computational molecular sciences. Numerous community codes in quantum chemistry, biomolecular simulation, and computational materials science are beginning to adopt modern, collaborative software engineering practices and tools, to the benefit of the broader field.

Over their long history, the computational molecular sciences have emerged as an essential partner with experiment in elucidating the structures and mechanisms that control chemical processes, and, in fact, often precede experiment in the knowledge-based design of new systems. The central tools of the field are its software products: hundreds of community-built scientific programs that have evolved over decades. In aggregate, these programs are comprised of tens of millions of lines of code written by thousands of developers and are used by hundreds of thousands of molecular scientists worldwide. This software is the computational incarnation of the many impressive molecular dynamics and quantum chemical models that have achieved such a level of robustness and accuracy that they are often considered “computational experiments” – in many cases with greater reliability than even laboratory measurements.

However, these tools also present significant challenges to the coming generations of computational molecular scientists as they attempt to extend the knowledge of their predecessors. These programs, which include both open source and commercial packages, involve hundreds of thousands to even millions of lines of handwritten code in an amalgam of programming languages such as C, C++, and the many variants of Fortran, as well as user-friendly languages from Perl to Python. The underlying algorithms of these codes are equally diverse, making use of the full range of computing motifs, including structured and unstructured grids, both dense and sparse linear algebra, graph traversal, and particle simulations requiring fast Fourier transforms for long-range electrostatics that drives molecular dynamics sampling or stochastic Monte Carlo (MapReduce) algorithms. The complexity of such programs is a natural reflection of the intricacy of the problems they are designed to solve, the rich feature set they offer, and the hardware for which they were developed.

The historical paradigm for software development in the molecular sciences has developed organically over decades, with an emphasis on the contributions of multiple research groups to a single code package. This approach creates many challenges to progress and sustainability, especially toward interoperability of codes; the efficient use of new, disruptive computing architectures; and scaling to a more significant number of developers.

In 2016, the U.S. National Science Foundation established the Molecular Sciences Software Institute (MolSSI) to tackle the many challenges facing the computational molecular sciences community. The MolSSI’s goal is to catalyze software innovation while simultaneously helping to make the community’s software infrastructure interoperable, sustainable, and agile. The MolSSI provides a vehicle by which the molecular science community is working to establish standards and best practices, and it is training students and postdoctoral associates in modern algorithm design and hardware technology. Access to well-designed, domain-specific training in these practices is a crucial aspect of community-wide adoption.

Partly as a result of the MolSSI’s efforts in both education and our Software Fellowship program, the field of computational molecular sciences is moving into the modern era, and best practices that have long been standardized in the computer science community are finally gaining a strong foothold in quantum chemistry, molecular dynamics, and related codes. Hundreds of CMS students have participated in the MolSSI’s novice workshops, boot camps, and Software Summer Schools, and, as a result, collaborative development models, distributed repositories, increased modularity, unit testing, continuous integration, code coverage measurement, and more are quickly becoming the de facto standards of the field. Indeed, the MolSSI’s motto in this arena is “education, demonstration, and consultation.”

The MolSSI’s Software Fellows are graduate students and postdocs across the U.S. who are supported directly by the Institute and learn modern software design and engineering techniques through the mentoring of our Software Scientist team. For example, MolSSI Software Fellow Caitlin Bannon of the University of California, Irvine, is developing chemper, a set of tools that are part of the Open ForceField Initiative for use with SMIRNOFF, an emerging XML force field format. Chaya Stern from the Tri-Institutional PhD Program in Chemical Biology is developing cmiles, which is designed to generate appropriate chemical identifiers for storing quantum chemical data. Marc Riera-Rambau’s fellowship project at UC San Diego is clusters_ultimate, which will serve as a tool in his effort to develop many-body potential energy functions. All of these Fellows have adopted a wide range of tools, including modern testing frameworks, continuous integration, code coverage, and static analysis, among others.

The MolSSI also provides an infrastructure that makes the adoption of such tools easier. For example, two of MolSSI’s Software Scientists, Levi Naden, and Daniel Smith, have developed a “cookiecutter” template to help implement many of these best practices for those developing Python-based CMS codes. The template automatically provides git initialization (with GitHub hooks), a basic testing structure through PyTest, pre-configured continuous integration for a variety of platforms, automatic package version control, a basic documentation structure via Sphinx, and more. The cookiecutter builds off the work and ideas of many large open-source organizations such as NumFOCUS, Apache Software Foundation, and Software Carpentry and tailors the experience to CMS software.

Finally, the MolSSI itself serves as an exemplar in the adoption of best practices through its many software infrastructure projects. The MolSSI’s QCArchive team of Daniel Smith, Doaa Altarawy, Levi Naden, and Sam Ellis, provides a prime example of a major project that takes advantage of software engineering best practices and design principles, while leveraging existing software tools and technology. The purpose of QCArchive is to provide an open, community-wide quantum chemistry database to enable the development of new molecular dynamics force fields, to assess the reliability of new methods, and to provide access to a broad swath of key data for subsequent machine learning efforts.

These and many other examples serve as strong evidence of the progress the CMS community is making to modernize and standardize its software development practices. Further adoption by major community codes such as LAMMPS, OpenMM, RDKit, Psi4, and others will speed this process and help our scientific domain to advance even more rapidly. While there is much work yet to be done, the field is clearly ripe for change.