Coding error postmortem

August 10, 2020

By Russ Poldrack, McKenzie Hagen, and Patrick Bissett

When it comes to computer programming, errors are simply a fact of life.  One study of professional software development teams from a number of different organizations of different sizes found that developers in the best managed organizations had an average of 1.05 defects per 1000 lines of code, whereas the worst managed organizations had an average of 7.5 defects per 1000 lines of code; other estimates place the number much higher from 15 to 50 errors per 1000 lines.  It’s hard to imagine that the error levels will be any lower when the code is written by scientists with minimal training in software engineering.  I have previously discussed an example of a small coding error having large effects on results, and the case of Geoffrey Chang, who had to retract several papers (including three Science papers) due to an error in a homegrown software package, provides a striking example of just what kind of impact an error can have.  By talking about errors openly, we normalize them and help move from a position of defensiveness to one of learning and process improvement.  In this spirit, below we discuss a recent error in one of our lab’s projects.

We had posted a preprint describing some issues that we had identified with the stop-signal task in the ABCD Study, along with the code used for all of the analyses.  The ABCD stop-signal team performed a detailed review our code and notified us of an error in the code that resulted in inaccurate estimation of one of the basic behavioral measures on the task (subsequently described in their response to our preprint).  The code in question converted from the Byzantine E-Prime output files containing the raw data to a data format that was more easily usable for our analyses.  In particular, because E-Prime spreads information across various columns, different trial types required combining multiple columns in different ways.  The code in question used an overly complex indexing scheme (in particular, using a double negative [~ isnull()] rather than a more intuitive data frame operation [notnull()]), which made it difficult to parse the Boolean logic of the code and thus more difficult to readily apprehend the error.  In addition, the size and complexity of the input data made it difficult to perform visual spot checks that might have otherwise identified the issue.

In hindsight, a relatively simple assertion would have identified the problem:

assert  df[response column][trial type index].isnull().all()

The trial type index referenced above is selecting for a trial type that should not have any values in a specific column that records responses, and an assertion that all of the values in the column should be null would have raised a red flag that trials of a different trial type were being selected in this index as well.

Once the error in our code had been confirmed, we reanalyzed the data, confirming that one of the results in the preprint had changed.  Based on this, we uploaded a revised version of our preprint to Biorxiv, and notified the action editor at the journal where the paper was under review of the error and the revised manuscript.

We also undertook an internal process to try to learn from the error.  We patterned this after the morbidity and mortality conferences that are a standard practice at major medical centers, in which physicians discuss problematic outcomes in a confidential setting in order to understand what went wrong and how it might have been prevented.  We spent part of our weekly lab meeting discussing in detail exactly how the error had come about, and what we might have done to fix it. No one likes talking about errors, but our discussion was focused on process improvement rather than blame, which made it easier for everyone to talk about it.  In this particular case, we identified two global issues that likely contributed to the error:

  1. First, the individual who had reviewed the analysis code for the project had focused on one script that implemented the majority of the analyses and did not review the preprocessing script which included the error.
  2. A more general issue that we flagged was a speed/accuracy tradeoff; because we were pushing to share the work quickly (as it provided suggested solutions to design issues in a large longitudinal study that is ongoing, so was highly time sensitive), our checking was likely not as systematic as it would have been if time had not been of the essence.

There is also always the possibility of unintentional “bug-hacking” — that is, the degree to which bugs are more likely to be found when they contradict our hypothesis.  In this case, the coding error inflated a value in a way that was in line with our expectations.

We learned a couple of important lessons from this experience:

  1. Time pressure is ubiquitous but also pernicious when it comes to science.  The speed/accuracy tradeoff is a fundamental feature of human behavior, whether it comes to experimental tasks or software development.  In the future we will exert extra quality control in cases where speed is essential (e.g. bringing in one of our team’s expert software developers for an external review).
  2. It’s essential to have a full description of the entire workflow from the primary data to the final results.  This can be especially challenging when different parts of the workflow are run on different computer systems, as is common in big-data settings.  The provenance of any intermediate files need to be crystal-clear; it’s common for data to be treated as “raw” even after they have gone through some preprocessing, e.g. to reformat the data into a more usable format, and these preprocessing steps need to be tracked and reviewed just as the analysis code is.

We have also started a couple of new practices in the lab to help reduce the likelihood of errors in the future. First, we have started holding lab-wide code review sessions as part of our regular lab meeting. In these sessions, we walk through code written by a member of the lab in order to identify errors and also talk about ways in which the code could be improved. There are several books that have been very helpful for some of us in learning about how to write better code, which others may also find helpful:

Second, we are working to increase the amount of testing that is applied to our code.  This started with a tutorial on testing using pytest in our lab meeting, and will likely continue with more advanced topics in the future.  These practices will not prevent us from making errors (as bug-free code is practically impossible), but they will hopefully reduce their prevalence and their impact on our work in the future.  We also hope that others will be willing to openly discuss their errors in order to normalize this discussion and provide insights that could be helpful for others in detecting and preventing errors.

Submit a comment