Sussman Lab

Managing computational research

Skip the preamble and jump directly to the guides:

Computational research often involves running many programs (executing simulations, analyzing raw data, generating plots, etc.). These programs often have complex interdependencies, and managing the overall workflow of a project can become an interesting challenge in itself for even quite modestly sized projects. While there are some computational projects that can be done entirely in a single Jupyter notebook, much of what we do involves working with multiple different executable chunks of code that we want to manage in complex ways. Perhaps, for instance, we want to simulate a large number of molecular systems (scanning across some parameter space), perform a standard set of analyses on all of the output, and then conditionally perform additional simulations and different analyses based on the outcome of the first set of analysis scripts.

This could (and often is!) managed by hand, but before a project gets so unwieldy that it becomes unmanageable, it is worth thinking about some reasonable default strategies — how will you set up and version control a directory with all of the programs you will need, for instance? — and it is worth thinking seriously about how you will manage the actual execution of a project over the long term.

Reasonable practices for computational research

In this section I describe a few (hopefully) reasonable practices related to setting up the structure of computational projects, and offer a few thoughts about writing scientific code.

From the command line to simple bash scripts

Automating your complex tasks can be enormously beneficial — freeing up your time but also making your research much more reproducible. This section starts from working on the command line and builds up to writing straightforward-but-still-quite-powerful bash scripts.

Workload and workflow management

While the bash scripts introduced in the previous section are extremely useful – and indeed, I often use them for relatively simple projects — sometimes it is worth invoking more powerful tools to manage our workloads and workflows. This section introduces a variety of workload management tools (from more complex bash scripts to job schedulers), and discusses one workflow management tool for even more complex computational projects.

Learn more

If you’re interested in “best practices” in computational research, the computational bio community has a number of good resources. This includes a “Best practices” and a “Good enough practices” pair of articles. Some of the information in those is specialized to the kind and scale of computational research more commonly done in that community, but much of the information is still worth reading and thinking about.

If you’re interested in learning more about the linux command line, LinuxCommand.org is a reasonable place to start, and Ryans Tutorials has some more in-depth information (particularly about many of the command line utilities that Linux ships with). If you’re interested in learning more about bash scripting, the bash guide is quite good.

I have yet to find very many good guides to workload and workflow management. If you come across any that you like, send me an email so I can (a) learn and (b) add them here!