Sunday, January 8, 2017

8 best practices to improve your scientific software

Noawadays scientists find themselves spending more and more time building software to support their research. Although time spent programming is often perceived first and foremost as time spent not doing research, most scientists have never been taught how to efficiently write software that is both correct and reusable. That's why the guys behind Software Carpentry have come up with a list of best practices to help you improve your scientific code. Because after all, to quote Ralph Johnson, before software can be reusable, it has first to be usable.


The following is a list of best practices extracted from a PLOS Biology paper (which just got published yesterday):

  • G. Wilson, D.A. Aruliah, C.T. Brown, N.P.C. Hong, M. Davis, R.T. Guy, S.H.D. Haddock, K.D. Huff, I.M. Mitchell, M.D. Plumbley, B. Waugh, E.P. White, P. Wilson (2017). Best Practices for Scientific Computing. PLOS Biology 12(1): e1001745. doi:10.1371/journal.pbio.1001745.

1. Write programs for people, not computers

  • 1.1 A program should not require its readers to hold more than a handful of facts in memory at once. Human working memory can hold only a handful of items at a time, where each item is either a single fact or a "chunk" aggregating several facts, so programs should limit the total number of items to be remembered to accomplish a task. For example, a function to calculate the area of a rectangle can be written to take four separate conditions:
    def rect_area(x1, y1, x2, y2):
       ...
    
    or to take two points:
    def rect_area(point1, point2):
       ...
    
    The latter is not only easier to read, but also much less error-prone. For example, in the first formulation it is easy to confuse the order of arguments and accidentally call the function using surface = rect_area(x1, x2, y1, y2).
  • 1.2 Scientists should make names consistent, distinctive, and meaningful. This should go without saying. For example, using non-descriptive names (e.g., a, foo, etc.), or names that are very similar (e.g., result, result2, etc.) is likely to cause confusion.
  • 1.3 Scientists should make code style and formatting consistent. Same story here. If different parts of a scientific paper used different formatting and capitalization, it would make that paper more difficult to read. Analogously, mixing CamelCaseNaming and pothole_case_naming takes longer to read and is more likely to cause readers to make mistakes.

2. Let the computer do the work

  • 2.1 Let the computer repeat tasks. Science often involves repetition of computational tasks such as processing large numbers of data files in the same way or regenerating figures each time new data is added to an existing analysis. Computers were invented to do these kinds of repetitive tasks, so make use of them. Don't type the same command over and over, have a little script to do it for you. This will save you time and avoid mistakes in the 437th occasion when you start to lose focus on what you're typing.
  • 2.2 Save recent commands in a file for re-use. For example, most command-line tools have a “history” option that lets users display and re-execute recent commands, with minor edits to filenames or parameters.
  • 2.3 Use a build tool to automate workflows. For example, use a Makefile to compile your code. Scripts like that allow you to express dependencies between files, i.e., to say that if A or B has changed, then C needs to be updated using a specific set of commands.

3. Make incremental changes

  • 3.1 Work in small steps with frequent feedback and course correction. Rather than planning months and years ahead, work in steps that are sized to be about an hour long, and group these steps in iterations that last roughly a week. This accommodates the cognitive constraints discussed above, and acknowledges the reality that real-world requirements are constantly changing. The goal is to produce working (if incomplete) code after each iteration. While these practices have been around for decades, they gained prominence starting in the late 1990s under the banner of agile development.
  • 3.2 Use a version control system. A version control system stores snapshots of a project's files in a repository (or a set of repositories). Programmers can modify their working copy of the project at will, then commit changes to the repository when they are satisfied with the results to share them with colleagues. Crucially, if several people have edited files simultaneously, the version control system highlights the differences and requires them to resolve any conflicts before accepting the changes. Many good systems are open source and freely available, including Subversion, Git, and Mercurial. Many free hosting services are available as well (e.g., SourceForge, GitHub, and BitBucket). Chances are the best one to use is the one that your colleagues are using already.
  • 3.3 Put everything that has been created manually under version control. This should include programs, original field observations, and the source files for papers. Automated output and intermediate files can be regenerated at need. Binary files (e.g., images and audio clips) may be stored in version control, but it is often more sensible to use an archiving system for them, and store the metadata describing their contents in version control instead.

4. Don't repeat yourself (or others)

  • 4.1 Every piece of data must have single authoritative representation in the system. Anything that is repeated in two or more places is more difficult to maintain. Every time a change or correction is made, multiple locations must be updated. To avoid this, programmers follow the DRY principle ("don't repeat yourself"), which applies to both data and code. For example, physical constants ought to be defined exactly once to ensure that the entire program is using the same value; raw data files should have a single canonical version, every geographic location from which data has been collected should be given an ID that can be used to look up its latitude and longitude, and so on.
  • 4.2 Modularize code rather than copying and pasting. Avoiding "code clones" has been shown to reduce error rates: when a change is made or a bug is fixed, that change or fix takes effect everywhere, and people’s mental model of the program (i.e., their belief that "this one’s been fixed") remains accurate.
  • 4.3 Re-use code instead of rewriting it. Tens of millions of lines of high-quality open source software are freely available on the web, and at least as much is available commercially. It is typically better to find an established library or package that solves a problem than to attempt to write one’s own routines for well established problems (e.g., numerical integration, matrix inversions, etc.).

5. Plan for mistakes

  • 5.1 Add assertions to programs to check their operation. An assertion is simply a statement that something holds true at a particular point in a program. For example, assert that your input arguments have the expected data type. These assertions serve two purposes. First, they ensure that if something does go wrong, the program will halt immediately, which simplifies debugging. Second, assertions are executable documentation, i.e., they explain the program as well as checking its behavior. This makes them more useful in many cases than comments since the reader can be sure that they are accurate and up to date.
  • 5.2 Use an off-the-shelf unit testing library. These libraries are available for all major programming languages, and allow you to initialize inputs, run tests, and report their results in a uniform way. Tests check to see whether the code matches the researcher’s expectations of its behavior. For example, in scientific computing, tests are often conducted by comparing output to simplified cases, experimental data, or the results of earlier programs that are trusted. Automated tests can check to make sure that a single unit of code is returning correct results (unit tests), that pieces of code work correctly when combined (integration tests), and that the behavior of a program doesn’t change when the details are modified (regression tests).
  • 5.3 Turn bugs into test cases. Another approach for generating tests is to write tests that trigger a bug that has been found in the code and (once fixed) will prevent the bug from reappearing unnoticed.
  • 5.4 Use a symbolic debugger. A better name for this kind of tool would be "interactive program inspector" since a debugger allows users to pause a program at any line (or when some condition is true), inspect the values of variables, and walk up and down active function calls to figure out why things are behaving the way they are. Debuggers are usually more productive than adding and removing print statements or scrolling through hundreds of lines of log output [69], because they allow the user to see exactly how the code is executing rather than just snapshots of state of the program at a few moments in time.

6. Optimize software only after it works correctly

  • 6.1 Use a profiler to identify bottlenecks. Today’s computers and software are so complex that even experts find it hard to predict which parts of any particular program will be performance bottlenecks. A profiler can help you identify bottlenecks quickly and accurately. This allows you to focus on optimizing the right code pieces (i.e., the ones that actually need speeding up).
  • 6.2 Write code in the highest-level language possible. Research has confirmed that most programmers write roughly the same number of lines of code per unit time regardless of the language they use. Since faster, lower level, languages require more lines of code to accomplish the same task, scientists are most productive when they write code in the highest-level language possible, and shift to low-level languages like C and Fortran only when they are sure the performance boost is needed.

7. Document design and purpose, not mechanics

  • 7.1 Document interfaces and reasons, not implementations. For example, a clear description like this at the beginning of a function that describes what it does and its inputs and outputs is useful:
    def scan(op, values):
       """Apply a binary operator cumulatively to the
          values given from lowest to highest,
          returning a list of results.
          For example, if `op` is 'add' and `values`
          is [1, 3, 5], the result is [1, 4, 9] (i.e.,
          the running total of the given values).
       """
       ...
    
    In contrast, the comment in the code fragment below does nothing to aid comprehension:
    i = i + 1  # Increment the variable `i` by one
    
  • 7.2 Refactor code in preference to explaining how it works. Rather than write a paragraph to explain a complex piece of code, reorganize the code itself so that it doesn’t need such an explanation. This may not always be possible—some pieces of code simply are intrinsically difficult—but the onus should always be on the author to convince his or her peers of that.
  • 7.3 Embed the documentation for a piece of software in that software. Doing this increases the probability that when programmers change the code, they will update the documentation at the same time. Embedded documentation usually takes the form of specially-formatted and placed comments. Typically, a documentation generator such as Javadoc, Doxygen, or Sphinx extracts these comments and generates well-formatted web pages and other human-friendly documents.

8. Collaborate

  • 8.1 Use pre-merge code reviews. In projects with shifting membership, such as most academic labs, code reviews help ensure that critical knowledge isn’t lost when a student or postdoc leaves the lab. Code can be reviewed either before or after it has been committed to a shared version control repository, although before is preferred. Experience shows that if reviews don’t have to be done in order to get code into the repository, they will soon not be done at all.
  • 8.2 Use pair programming when bringing someone new up to speed and when tackling particularly tricky problems. An extreme form of code review is pair programming, in which two developers sit together while writing code. One (the driver) actually writes the code; the other (the navigator) provides real-time feedback and is free to track larger issues of design and consistency. Several studies have found that pair programming improves productivity [64], but many programmers find it intrusive. It is therefore recommended that teams use pair programming when bringing someone new up to speed and when tackling particularly tricky problems.
  • 8.3 Use an issue tracking tool. Once a team grows beyond a certain size, it becomes difficult to keep track of what needs to be reviewed, or of who’s doing what. Teams can avoid a lot of duplicated effort and dropped balls if they use an issue tracking tool to maintain a list of tasks to be performed and bugs to be fixed. Free repository hosting services like GitHub include issue tracking tools, and many good standalone tools exist as well, such as Trac.

Research suggests that the time cost of implementing these kinds of tools and approaches in scientific computing is almost immediately offset by the gains in productivity of the programmers involved. How to implement the recommended practices can be learned from many excellent tutorials available online or through workshops and classes organized by groups like Software Carpentry.

Source:

  • G. Wilson, D.A. Aruliah, C.T. Brown, N.P.C. Hong, M. Davis, R.T. Guy, S.H.D. Haddock, K.D. Huff, I.M. Mitchell, M.D. Plumbley, B. Waugh, E.P. White, P. Wilson (2017). Best Practices for Scientific Computing. PLOS Biology 12(1): e1001745. doi:10.1371/journal.pbio.1001745.