Stephan Druskat, Daniel S. Katz, David Klein, Mark Santcroos, Tobias Schlauch, Liz Sexton-Kennedy, and Anthony Truskinger
December 3, 2018
(reposted from SSI blog)
By Stephan Druskat, Daniel S. Katz, David Klein, Mark Santcroos, Tobias Schlauch, Liz Sexton-Kennedy, and Anthony Truskinger.
This post is part of the WSSSPE6.1 speed blog posts series.
Like the behemoth cruise ship leaving the harbor of Amsterdam that overshadowed our discussion table at WSSSPE 6.1, credit for software is a slowly moving target, and it’s a non-trivial task to ensure that the right people get due credit.
In this blog post, we aim to review the current state of practice in terms of credit for research software. We also attempt to summarize recent developments and outline a more ideal state of affairs.
First of all, though, we need to define the object of the discussion, i.e., credit-worthy research software. While some suggest that only open source software should be considered a “scholarly contribution” (Hafer & Kirkpatrick 2009), this was not as clear during our discussion. In the real academic world, commercial or commercially-supported software makes contributions to research, either directly or as dependencies of other software. And in some scenarios, software that is supported may realistically be more important than software that is openly available, even if the support is commercial rather than academic. This does not affect the best practice of open sourcing research software whenever possible, especially when publicly funded.
Similarly, “credit” can mean “academic credit”, still most commonly linked to citation and metrics like the h-index. But more generally, in both the commercial and academic scenarios, credit ultimately serves very similar purposes: career advancement, monetary compensation, and enlargement of spheres of influence. These are underlying assumptions in the discussion of credit for research software to follow.
Authors creating software include the people who key it in, as well as those who architect and design it, and even those that come up with the initial idea for it. For many fields, all of the above receive credit when publishing a paper associated with the software. Academics are often inclusive because there is a community context that recognizes the roles of students, postdocs, and principal investigators. Software maintenance and support is important for continued usefulness of the software package. Credit for this activity comes naturally through the online identifiers of the people involved in the support.
A less obvious group to receive credit are those in non-development roles: documentation, testing, project management, UI/UX, etc. The authors of documentation have an obvious way of claiming authorship but many of the others do not. When a software moves from a research idea to code to an actual service many of these roles become essential. Deployment of a widely used service can become a major project requiring the services of a team whose members should all get credit.
Once we have defined what aspects of scientific software people can be credited for, we need a mechanism to attribute this credit to the people involved instead of the software itself. Some software might have a dedicated (methodological) paper that was published in a regular scientific publication and can be cited/credited using the regular paper citation mechanisms. For software without such a paper, a lightweight description can be created that at least gives the software a persistent identifier like a DOI that can list authors. An even lighter model would be to refer to an “AUTHORS.txt” in a public repository. There is obviously some responsibility for the authors of the software to declare their contribution in some way so that it can be picked up by others. If no authors are listed, the project itself can be credited, but this is problematic, as the authors can change over time, and one cannot easily trace back to the right people.
Awarding credit can be difficult. Citations are a recognized solution for academic research. However, academic citations not attached to publications are still a new concept and cultural change is required to make citing software in research papers effective.
One way to receive credit for your research software is to write and publish a “classical” paper or a “software metapaper” about it. There are different journals that accept and publish papers that focus on the description of research software like JORS and JOSS. While, based on the Software Citation Principles (Smith et al. 2016), this should be considered a temporary solution where the long term goal is for research software itself to be specifically and directly cited, this is currently a reasonable option to choose.
Another option is to use digital objects archives like Zenodo. These archives allow you to publish different types of digital objects in a citable way like preprints, data sets, and software packages. For that purpose, you can provide a specific release of your research software (e.g., as a source code package) including citation metadata. On this basis, you receive a DOI (digital object identifier) which others can use to concretely cite the specific software release. On the basis of the DOI, Zenodo, for example, provides usage statistics.
Moreover, there are technology-specific solutions. For example, the R ecosystem supports mechanisms to automatically analyze the dependencies of a specific R software package. On this basis, it can determine the contributions of those packages. However, the open question stays whether the technology-specific approaches help to push the topic on a larger scale. For example, similar approaches in context of the Python community have not gained sufficient interest.
We also have two new proposals that can help award credit. The first, software metrology, is important for understanding how software is used. If software does not have a source of funding, as is the case for many academic open-source software packages, then evidence of utility is needed to secure more funding. Usage metrics are a backup: they help patch cases where, for whatever reason, normal citations are not awarding appropriate credit. There is an obvious privacy concern attached to software metrology, and as such we recommend that any usage tracking must fail gracefully, be absolutely transparent, be able to be turned off, and should only collect anonymized data. Ideally, collected data should be public as well.
Second, we propose that automated dependency tracking should be a staple of academic software packages. A machine-readable dependency tree can help discoverability by producing “credit” trees that clearly show where transitive credit is due. Such dependency tracking is probably better established as a technology-agnostic “loose standard” or template that can be used to form dependency trees. These dependency trees would allow for much fine-grained and inclusive credit to be automatically assigned to authors. The CRAN package management system currently can list package dependencies and is possibly a useful model. There are established technologies, such as integration with ORCiD and Zenodo that could help make this concept a reality.
The idea of papers in academia have existed for over 500 years, but primarily for the purpose of authentication and authority, rather than for the provision of credit and acknowledgment or attribution. These papers were originally written by a single author, and thus credit for the paper could be clearly tied to the paper via the author. Since then, the number of authors per paper has increased, leading to difficulties in apportioning credit amongst authors, but giving credit to the set of authors has not really changed - as long as a paper lists a set of authors, they are all credited. More recently, the idea of data citation has also become somewhat popular, following the model of paper citation: each data set is published, along with a set of “authors”, and these authors are credited when the data is cited in a publication. However, software is somewhat different than papers and data, and it’s not clear that the same method for assigning credit will work.
Many of the issues arise from these differences. For example, software has multiple versions (both formal numbered versions, also called releases, and many more informal versions, also called commits) and each might have different sets of authors. And as previously discussed, determining the authors for even a single version can be a significant challenge. In addition, the fact that software can be both forked and wrapped leads to questions about credit. A fork is a copy, and by itself, the credited authors should not change, but after some amount of work in the code, new authors should likely be added, and possibly old authors removed. Similarly, if a small wrapper around a large code is credited, how should the developers of the large code be credited, if at all? In both the forking and wrapping case, the underlying questions are at what point is new software independent? And who gets credit? Similarly, software is an implementation of an algorithm, and this leads to questions about credit for the algorithm vs credit for the software. Transitive credit might be a useful solution for some of these issues. Another challenge is related to frameworks and the components within them. For credit to be appropriately assigned, the authors of the framework and of each component must be identified, and any particular use of the framework should output which components were used (and thus which authors should be credited).
Although there is room for improvement, some important progress has been made lately. In general the community’s awareness about crediting software and its developers is perceived as increasing. Some institutions are starting to move away from just papers as the research metric. Here, making software and data citable is crucial to facilitate citations. However, tracking what you’ve used is not equivalent to citing and giving credit. Awareness of data citation policies is starting up, perhaps software can ride the same bandwagon? Software Carpentry played a major role in disseminating best practices in developing (“professionalizing”) and crediting software. The fact that funding bodies are being more aware and are re-thinking policies and requirements for funded projects in terms of research software will also have an impact. The “reproducibility crisis” that increased the awareness for sustainable research software might also strengthen the process of establishing and enforcing a crediting system for research software.
Among the current challenges for research software credit is the balance between implementing ideal best practices and the need to make changes now to make an improvement. Efforts are underway to provide methods, for example, for the implementation of software citation that builds on best practices as put forward in the Software Citation Principles (Smith et al. 2016), e.g., in the FORCE11 Software Citation Implementation WG. However, researchers as well as developers of research software need to find ways to receive credit through today’s (less optimal) methods, e.g., by writing software metapapers that can be cited traditionally.
Until more optimal solutions have been developed and the culture change around credit for research software has moved forward, the community should adopt the “good enough” practices available to it while also putting effort into developing better practices.
The latter include, for example, a maximally automated workflow for providing, disseminating, and using software citation metadata; methods to express and retrieve reliable authorship information for software versions and “products”, frameworks, wrappers, forks; an automatable implementation of calculating transitive credit for, e.g., dependencies; and policies that support software credit through, e.g., funding requirements; metrics to measure software impact.
While the nature of software makes it hard to receive and assign credit for software, the research community must collaborate on solutions and change its culture in order to acknowledge the central role that software has in today’s research landscape.