Thursday, April 19, 2018

10 simple rules for making research software more robust

Scientific software is often developed by a single person, usually a graduate student or a postdoc. The code might run just fine on their own computer, but what if someone else wants to run it? More often than not, scientific code is poorly documented, might work in unexpected ways (or not at all), rely on nonexistent paths or resources, or might simply fail to reproduce what was published in the paper. To avoid many common challenges associated with scientific code, Morgan Taschuk from the Ontario Institute for Cancer Research (Toronto, Ontario, Canada) and Greg Wilson from the Software Carpentry Foundation (Austin, TX) have come up with a list of ten simple rules.


The following is a list of best practices extracted from the following paper:

  • M. Taschuk, G. Wilson (2017). Ten simple rules for making research software more robust. PLOS Computational Biology 13(4): e1005412, doi:10.1371/journal.pcbi.1005412.

Rule 1: Use version control

  • Put everything that you write or make into version control as soon as it is created. Common choices include Git and Subversion. If you're new to version control, it is simplest to treat it as a "better Dropbox" and to use it simply to synchronize files between multiple developers and machines.
  • Use a feature branch workflow. A feature branch workflow designates one parallel copy (or "branch") of the repository as the master. You then create a new branch from it each time you want to fix a bug or add a new features. This allows work on independent changes to proceed in isolation; once the work has been completed and tested, it can be merged into the master branch for release.

Rule 2: Document your code and usage

  • Write a good README file. The README is usually available even before the software is installed, exists to get a new user started, and points them towards more help (see some good examples).
  • Print usage information. The program should also print usage information when launching from the command line. Usage provides the first line of help for both new and experienced users. Terseness is important: usage that extends for multiple screens is difficult to read or refer to on the fly.

Rule 3: Make common operations easy to control

  • Allow the most commonly changed parameters to be configured from the command line. Being able to change parameters on the fly to determine if and how they change the results is important as your software gains more users since it facilitates exploratory analysis and parameter sweeping.
  • Check that all input values are in a reasonable range at startup. Few things are as annoying as having a program announce after running for two hours that it isn’t going to save its results because the requested directory doesn’t exist.
  • Choose reasonable defaults where they exist. You can set reasonable default values as long as any command line arguments override those values.
  • Set no defaults at all when there aren't any reasonable ones.

Rule 4: Version your releases

  • Increment your version number every time you release your software to other people. Semantic versioning is one of the most common types of versioning for open-source software. Version numbers take the form of “MAJOR.MINOR[.PATCH],” e.g., 0.2.6. Changes in the major version number herald significant changes in the software that are not backwards compatible, such as changing or removing features or altering the primary functions of the software. Increasing the minor version represents incremental improvements in the software, like adding new features. Following the minor version number can be an arbitrary number of project-specific identifiers, including patches, builds, and qualifiers (e.g., alpha, beta, or -RC for release candidates).
  • Make the version of your software easily available by supplying --version or -v on the command line.
  • Include the version number in all of the program's output.
  • Ensure that old released versions continue to be available. Common options include crafting an official release on GitHub, or uploading your code to apt, yum, homebrew, or PyPI.

Rule 5: Reuse software (within reason)

  • Make sure that you really need the auxiliary program. Dependencies bring their own special pain: all too often, support requests descend into debugging errors produced by the other project due to incompatible libraries, versions, or operating systems.
  • Ensure the appropriate software and version is available. Either allow the user to configure the exact path to the package, distribute the program with the dependent software, or download it during installation using your package manager.
  • Ensure that reused software is robust. Relying on erratic third party libraries or software is a recipe for tears. Prefer software that follows good software development practices, is open for support questions, and is available from a stable location or repository using your package manager.

Rule 6: Rely on build tools and package managers for installation

  • Document all dependencies in a machine-readable form. For example, it is common for Python projects to include a file called requirements.txt that lists the names of required libraries, along with version ranges:
    requests>=2.0
    pygithub>=1.26,<=1.27
    python-social-auth>=0.2.19,<0.3
    
  • Avoid depending on scripts and tools which are not available as packages. In many cases, a program’s author may not realize that some tool was built locally and doesn’t exist elsewhere. At present, the only sure way to discover such unknown dependencies is to install on a system administered by someone else and see what breaks. A good way to test this is to install your package from scratch on a clean or virtual machine (e.g., via Amazon's Elastic Compute Cloud), or in a Docker container.

Rule 7: Do not require root or other special privileges to install or run

  • Do not require root privileges to set up or use packages. Scientific software packages may not intentionally be malware, but one small bug or over-eager file-matching expression can certainly make them behave as if they were.
  • Allow packages to be installed in an arbitrary location. This goes hand in hand with avoiding root privileges, since some users may want to install the package in their home directory ~/packagename.
  • Ask another person to try and build your software before releasing it.

Rule 8: Eliminate hard-coded paths

  • Set the names and locations of input and output files as command-line parameters. If your package is installed on a cluster, for example, the user’s data will almost certainly not be in the same directory as the software, and the folder C:\users\yourname\ will probably not even exist
  • Do not require users to navigate to a particular directory to do their work.

Rule 9: Include a small test set that can be run to ensure the software is actually working

  • Make the tests easy to find and run. Many build systems will also run unit tests if provided them at compile time. For users, or if the build system is not amenable to testing, provide a working script in the project’s root directory named runtests.sh or something equally obvious. This lets new users build their analysis from a working script.
  • Make the test script's output easy to interpret. Screens full of correlation coefficients do not qualify: instead, the script’s output should be simple to understand for nonexperts, such as one line per test, with the test’s name and its pass/fail status, followed by a single summary line saying how many tests were run and how many passed or failed.

Rule 10: Produce identical results when given identical inputs

  • Echo all parameters and software versions to standard out or a log file alongside the results. The usage message tells users what the program could do. It is equally important for the program to tell users what it actually did.
  • Produce the same results each time the same version of the program is run with the same inputs.
  • Allow the user to optionally provide the random seed as an input parameter. Many applications rely on randomized algorithms to improve performance or runtimes. As a consequence, results can change between runs, even when provided with the same data and parameters. However, most programs use a pseudo-random number generator, which uses a starting seed and an equation to approximate random numbers. Setting the seed to a consistent value can remove randomness between runs. Allow the user to optionally provide the random seed as an input parameter, thus rendering the program deterministic for those cases where it matters.
  • Make sure acceptable tolerances are known and detailed in documentation and tests.

Source:

  • M. Taschuk, G. Wilson (2017). Ten simple rules for making research software more robust. PLOS Computational Biology 13(4): e1005412, doi:10.1371/journal.pcbi.1005412.