I do not think I have any particular authority when it comes to “best practices” in scientific computational research. Rather, after doing this for several years (and learning the consequences of violating essentially every suggestion below), I think I have developed a set of “reasonable practices” that I feel comfortable suggesting. These are designed with two goals in mind: First, I value reproducibility of scientific research, and I think we should strive to organize our research so that another scientist (or, in a recurring theme in this document: your future self) can meaningfully reproduce all of our results should they so desire. Second, we want to organize the code we write, and the way we organize the bits and pieces of our computational workflow, to minimize the amount of stress we have to deal with. Some questions — Wait, what does this chunk of code do? Which version of the program was actually used to generate this data? What model parameters did I vary and over what range — should just never come up, and a few extra minutes spent near the beginning of a project can save a lot of headache later.
Below I’ll talk briefly about version control before diving into a reasonable way of organizing your project structure. I’ll then discuss a few thoughts about reasonable practices when it comes to writing custom code (including, notably, the advice not to do it).
Version control
This should go without saying, but I of course think that the important parts of your project — the custom code you’ve written, the bash scripts that run different programs over a range of parameters, the way to turn data into plots — should be backed up and also version controlled. I won’t repeat all of the information I wrote about how git works and how to use it. I’ll just assume that you understand that the kind of project directory structure I suggest should probably be a git repository. Speaking of:
Organizing projects
I very frequently end up having project directories that look schematically like the following (I am, in fact, just directly copying the structure of the project I’m working on right now, changing some file names, and omitting a lot of files):
reactionTimeSymmetryBreaking/
|-- build/
|-- data/
| |-- README.md
| |-- processedData1.nc
| |-- whereIsTheRawData.txt
|-- doc/
| |-- (Doxygen stuff here)
|-- scripts/
| |-- README.md
| |-- primarySimulationLoop.sh
| |-- orderParameterTraceGeneration.sh
| |-- plotsAndVisualizationTools.py
|-- src/
| |-- (A whole lot of code)
| |-- analysisCode/
| | |-- (more code)
|-- README.md
|-- Installation.md
|-- CMakeLists.txt
|-- perturbationMeasurement.cpp
|-- primarySimulationAnalysis.cpp
|-- (other files... .gitignore, etc)
A few comments about each of the components of the structure above. First, the directory itself has a sensible name, allowing me to identify the purpose of the directory from the command line / a file explorer. The custom code for this project lives in the src/
directory. In this case I’m writing code in C++, with main simulation and analysis cpp
files in the root directory which get compiled with the help of CMake into executables that live in the build directory. For even moderately substantial pieces of code — say, more than a few thousand lines of code — it is probably worthwhile to have some code documentation system in place. The doc/
directory is a good choice for this.
Next is a dedicated directory for the data associated with the project. You should always save the raw data, but these sometimes extremely large datasets do not always fit nicely into this tidy directory structure (and, for instance, if you’re hosting the repository on GitHub the files might just be too big). Processed and cleaned up data should probably live in the data/
directory, and if necessary I like to put a file reminding me where the actual data lives. It might not seem like it when you’re deep in the weeds of a project, but when you get the reviews on a paper back a few months after you’ve submitted something — or when you have a new idea to use old data multiple years after you generate it — it is nice to have a reminder of where to look for everything.
Separate from the main chunks of code, I like to have a scripts/
directory that contains, e.g., any bash script I use to repeatedly run a simulation over parameter space, or a script that loops over a program that analyzes the data produced by that first script. This is also a useful place to store preliminary tools that you might use to plot the data or generate visualizations. If the scope of the project is extremely clear — that is, there is no way that the project corresponds to anything other than exactly one paper — I sometimes add a paper/
directory with all of the LaTeX, bibliographic information, and final figures. It is rarely so clear, and I almost always just have a separate directory (and a separate repository) for the paper.
Finally, notice how almost all of the directories have a README.md
file. When navigating a repo on github is is nice to have a human-readable summary of what is happening in each (sub)directory. Learning a little bit of GitHub flavored markdown can be worthwhile.
Alternative organizational systems
Are there other ways to set up your project? Obviously. For instance, here is a project structure suggested in this Quick guide to organizing a computational biology project
baseProject/
|-- doc/
| |-- paperAnalysis.html
| |-- paper/
| | |-- manuscript.tex
|-- data/
| |-- YEAR-MONTH-DAY/
| | |-- yeast/
| | | |-- yeast.csv
| | |-- worm/
| | | |-- worm.csv
|-- src/
| |-- CMakeLists.txt
| |-- analysisCode.cpp
|-- bin/
| |-- analysisCode.out
| |-- parseData.py
|-- results/
| |-- YEAR-MONTH-DAY1/
| | |-- runEverything
| | |-- results.txt
| |-- YEAR-MONTH-DAY2/
| | |-- runEverything
| | |-- results.txt
It’s different, and for different kinds of computational research patterns it may be extremely reasonable. As might be the structure suggested in this git repo on “Reproducible Research”. The key thing is to think about the organization ahead of time, and think about how future you will want to understand the structure of your project, the way the data in it way generated, and how to find different bits and pieces you’ll want to look for.
Writing code
At some level, it is ridiculous for me to give advice on writing code for computational research. I have taken no Computer Science classes in my lifetime, and I only learned to write actual code when I was a postdoc. On the other hand, maybe that means that in the course of writing and maintaining a handful of open-source software packages (on the order of tens-but-not-hundreds of thousands of lines of code) I have made so many obvious mistakes that I have something useful to say. Below I outline a handful of key points.
Don’t write code
The first rule of writing code is not to do it. Seriously: do what you can to avoid writing code. Does a robust, well-maintained piece of software already exist that does what you want? Fantastic — go use it. We’re not here to re-invent the wheel, and unless you have some extremely clever ideas it’s unlikely that you’ll end up writing code that is better, or more robust, that the code written by dedicated scientific software teams.
The exception to this is when learning. I don’t want people to re-write efficient, MPI- and CUDA-accelerated molecular dynamics simulators when LAMMPS (and HOOMD, and…) already exist. But we don’t want to use those packages as black boxes. It can be useful to think about how the algorithms are designed, and to code up simple versions of them, so that you’re more comfortable interacting with these mature pieces of software.
Use the highest-level language possible
We don’t win any awards for writing code that sums the values in a short array in assembly. If you do have to write code, use the highest-level language possible. Even when I know that I’ll ultimately have to write something in CUDA/C++, I often code up a simple version first in python (or mathematica, only because I learned how to plot in mathematica as an undergrad and didn’t encounter python until I was a postdoc).
Premature optimization: please don’t
In some ways a corollary of the above, you should almost always focus on the correctness of your code, and then optimize only when necessary. This is particularly true, I think, of the sort of micro-optimizations — tinkering with the ordering of memory loads, or simplifying intuitive-but-long arithmetical statements into short-but-inscrutable ones — that are always so tempting. They make the code radically harder to read and maintain, and you often don’t even know if you are targeting a real (or important) bottleneck in your program. Starting with efficient choices of your algorithms is almost always much more important, and hence much preferred, compared to spending time trying to make some existing code go faster.
Code style: write programs for people
When you write code, please do not fall into the trap of writing it for the computer. If you’re writing, for instance, in C++, there will already be a compiler that takes your code and converts it to binary machine instructions. So, take the time to instead write your code for your fellow humans.
There are several ways to help accomplish this goal.
Naming conventions
Perhaps the first is to use reasonable names for everything in your program — the functions you will call, the variables you will declare, the classes you will create. Every serious code editor (yes, even relics of the past like vim) has easy-to-use autocompletion that you can access, so there is no reason to use inscrutable names just to try to save on keystrokes. What constitutes a reasonable name? Well-chosen names should, I think, save you from writing additional comments to document your code. For instance, consider this line:
double currentOrderParameter = computeFlockingOrderParameter(vicsekConfiguration);
The names — nouns for variables, verbs for functions — tell you everything you need to know about what this piece of code is trying to do.
Code organization
A second useful consideration is to divide up your program into relatively short functions that each do a single thing. Rather than writing a monolithic chunk of code that, for instance, performs a velocity verlet update in a simulation, write a set of smaller functions — computeForces()
, updateVelocities()
, updatePositions()
that you can call in sequence. Is the task of computing all of the forces in a molecular dynamics simulation itself a large function? Have that function itself constructNeighborLists()
, evaluatePairwiseDistances()
, and calculatePairwiseForces()
. This is not just a matter of style, but it actually makes your code easier to reason about. Readers of your code — which will most typically be your future self, but might include others — will typically only be able to hold a handful of facts in their memory at once. By breaking your code up into smaller, easily understood chunks that can be chained together, you can simplify the process of thinking through even large and complex code bases.
Code formatting
Finally, chose a consistent formatting style for your code. If different parts of your code use different indentation levels for the braces indicating control flow, or if names use a mix of lowerCamelCase
(for some reason, my favorite) and whatever_this_style_is_called
, code gets harder and harder to look at and to think about. If you’re working by yourself, just go ahead and use whatever formatting choices strike your fancy. If you’re collaborating, don’t get caught up in enforcing formatting rules: use a tool (like clang-format for C+±like code) that will allow you to declare what the format will be for the code, and then run the tool to enforce that format.
Don’t repeat yourself
The phrase even has it’s own wikipedia page. Have you already written a piece of code that does something, and are you about to copy-paste it into a new part of your code? Did you even read the section above titled “Don’t write code”??
What if there is a bug in that section of code? Will you remember all of the place you need to change it? Repeating code like this is a sign that you have a chance to refactor your code, separate out whatever that chunk of code is doing, and only write the code once.
Write with bugs in mind
Perhaps this is more a reflection on my own skills as a programmer, but I am relatively convinced that every sufficiently large piece of code has an infinite number of bugs in it. With that in mind, write your code with these bugs in mind: add assertions to your code to make sure that at important parts of your program things are as you expect (the points are all inside the simulation box, the eigenvalues of the matrix you know should be positive definite are all greater than zero). Think about how you will implement automated testing: As you write new functions, are they simple enough that you can evaluate their correctness by inspection? Do you need some automated unit tests? Or end-to-end tests?
Finally, when bugs come up, think through them systematically and rigorously. Identify an easy-to-reproduce failure case, and then line by line figure out what is wrong. Using good debugging tools — which can range from the old “print-stuff-to-the-screen” standby to symbolic debuggers like GDB to tools like valgrind. Think about turning these failure cases into test cases for future iterations of the code.
Documentation
Finally, sufficiently long programs — i.e., those which cannot fit into your memory all at the same time — should be documented. The most powerful thing you can do is to first document the purpose of your code and the design decisions you have made in solving whatever problem you are working on. Combining this kind of documentation with good documentation of the interfaces to functions and classes you write will go quite a long way to maintaining the readability and understandability of your code. This is especially true if the code itself uses sensible names for its functions, classes, and variables. If you find yourself writing too much documentation on the steps of the code itself, that might be a hint that you have an opportunity to refactor it into something easier to reason about. Yes, sometimes you actually do have to write a complicated piece of code to implement an especially gnarly algorithm. But sometimes you are just making it harder on yourself