Sussman Lab

The flow of computational research can be managed by hand, but I think it is worth getting comfortable (a) working from the command line and (b) acquainting yourself with some of the most common ways of setting up scripts to automate your tasks for you. Below I’ll walk through the first few levels of running programs from the command line — not necessarily “levels” in terms of their value, but in terms of how robust, flexible, and/or reproducible they can be. All of the below assumes a Linux-like environment (which might actually be Linux, or MacOS, or Windows Subsystems for Linux), and I’ll focus on the Bash shell – there are more modern shells you can use (for instance, MacOS has made zsh its default instead of Bash), and I think in some ways they are better. But Bash is still everywhere, and it’s good to learn the basics.

This is meant to be read in order if you aren’t familiar with most of these concepts, but feel free to jump to any of the main sections:
Level 0: Command line jobs
Level 1: Simple bash scripts
Level 2: More complex bash scripts
Level 3: Setting and forgetting

DISCLAIMER: This guide is still under construction. While I think it is a good indication of how to get started using bash, it does not include best practices for error checking, handling arguments, and so on.

Level 0: Running a job from the command line

The Linux command line is a powerful (and, at first, confusing) place to be. If you’ve only ever used graphical user interfaces, I strongly suggest reading a tutorial or two — there are a number of excellent ones just a google search away — before proceeding.

If you’ve already understood that tutorial-level information, this first section will be an extremely rudimentary reminder: The most basic way to run a job from the command line is to literally run it from the command line in your terminal.

Suppose you are in a directory with an executable program called doSomethingAwesome. You could execute that program by typing ./doSomethingAwesome and hitting return. Throughout this page I’ll visualize this kind of “execute something in the shell” by using a $ to denote your command line prompt and showing something like:

$ ./doSomethingAwesome

Command line flags

Perhaps that program has some command-line flags that can be used to control its behavior – perhaps n sets the number of particles, i sets the number of simulation timesteps, r sets the density, and m sets the type of boundary conditions to use. You could then run

$ ./doSomethingAwesome -n 10000 -i 1000000 -r 1.2 -m periodicBoundaryConditions 

to simulate ten thousand particles for a million timesteps under periodic boundary conditions at a density of 1.2. Or whatever. You would then have the pleasing experience of watching any output of your program scroll by on the screen.

Summary

Running things directly from the command line like this is very powerful and extremely useful – I do it all the time. But it also has real issues and limitations. Unless you are meticulous you do not have an indefinite record of the commands you actually entered (a problem for scientific reproducibility), and if you have to run many simulations over a range of parameters you will be tied to your keyboard – constantly typing parameter values, hitting enter, waiting, and repeating.

Surely there’s a better way.

Level 1: Creating and executing a simple bash script

Bash is a ubiquitous shell (a “shell” is just a program that takes commands from a keyboard and then hands those commands over to the operating system to perform) that you can access on any modern (and many pre-modern) operating systems. A bash script is just a plain text file (by convention they’re given .sh endings, but that is irrelevant) that contains a sequence of commands. Here I’ll go over the absolute basics; there are many tutorials (e.g., here is one of the first google references, and on a skim it seems very reasonable).

The simplest bash script

Let’s create a file, hello.sh, which contains the following two lines.

#!/bin/bash
echo "Hello world!"

The first line tells the system to interpret the rest of the file in bash, and it must be exactly this, with no spaces, on the very first line of the file. If we were to try to execute this by typing ./hello.sh and hitting return, nothing would happen. It’s just a plain text file, so we need to permit our shell to execute this program. On the command line do this: chmod +x ~/bin/hello.sh.

Now, if we type ./hello.sh we should see our delightful message printed on our screen.

A script with multiple commands

The first line in a bash script is the mandatory #!/bin/bash, but after that you can write not just one but any number of commands. For instance, suppose you wanted to write a script that first printed “Hello world!” to screen, then searched all files within the current directory and all sub-directories for occurrences of the word “potato,” created a file named “potatoFile.txt” that contains on each line the file that has this word and the contents of the line of the file with that word, and then opened that new file in the vim editor (I am not responsible for why you might want to do this). To accomplish your aim, you could create the findPotato.sh file containing

#!/bin/bash
echo "Hello world!"
grep -r "potato" . > potatoFile.txt
vim potatoFile.txt

Once you make this executable by chmod-ing it, you can run it and achieve your desires – first line is the mandatory opener of bash scripts, the second prints “Hello world!” to screen, the third line uses the “grep” command line tool to search for instances of “potato” in your files and then re-directs the output to a new file with the given name, and the fourth line uses the vim program to open that file.

Again: a bash script is just a sequence of commands, and they will be executed one after the other. To learn more about the most common command line tools, this page is a good place to start. If nothing else, learn commands for copying files, interacting with a remote, creating and moving around directories, and then take a look at “grep” and “sed”… and maybe “awk”.

Running a sequence of programs in a plain script

Since a bash script is just a sequence of commands that will be executed, we can already do something useful. Instead of sitting at the keyboard and periodically typing the next job in a parameter sweep, we can set up a script to go through all of the jobs we want to submit. Suppose I have a pattern in which I run a simulation, use a different program to analyze the data, and then I repeat for a new parameter set. We could make a bash script that looks like:

#!/bin/bash
# Anything after the first line that starts with a hash is a comment, and won't affect the script
# The commands below used versions (or commits) XX and YY of the two programs
./doSomethingAwesome -n 10000 -i 1000000 -r 1.0
./analyzeTheData -n 10000 -i 1000000 -r 1.0
./doSomethingAwesome -n 10000 -i 1000000 -r 1.1
./analyzeTheData -n 10000 -i 1000000 -r 1.1
./doSomethingAwesome -n 10000 -i 1000000 -r 1.3
./analyzeTheData -n 10000 -i 1000000 -r 1.3
./doSomethingAwesome -n 10000 -i 1000000 -r 1.4
./analyzeTheData -n 10000 -i 1000000 -r 1.4
./doSomethingAwesome -n 10000 -i 1000000 -r 1.5
./analyzeTheData -n 10000 -i 1000000 -r 1.5

We run this, and then sit back and watch the results of 10 commands execute one after the other (each taking however long they take). Better yet, we can now leave and grab a coffee while the computer does some work for us. Even better, and more seriously, we now have a file with a record of exactly what code we ran (thanks to the comments we left on the third line of the file) and over what parameters.

Summary

Level 1 is a fantastic step forward in its power and flexibility. It is also kind of tedious – do you really want to write out by hand all of the command line arguments you were going to have to type? What if you want to scan over many different parameters in a grid search? It is also not exactly the most human readable – if you had a typo in one line out of a list of tens or hundreds or thousands of simulations, would you even notice?

Surely there’s a better way.

Level 2: More complex scripts

You start out thinking of bash scripts as containing a simple sequence of commands, but bash scripts can use a variety of programmatic concepts and become quite sophisticated. This section is not meant to be at all exhaustive in describing what a bash script can do, but I just want to give a little flavor.

Command line options and variables

First, we can easily set local variables in our bash script: we just write down the name, an equals sign, and a value. One can then refer to these values in later parts of the bash script by prepending a dollar sign. For instance, we could modify our hello.sh example to be:

#!/bin/bash
v1=Hello
v2=world!
echo $v1 $v2

Executing this would, as you might expect, cause “Hello world!” to be printed to the screen. One has to be a little careful, by the way: bash wants to interpret everything as a string and by default uses spaces to separate variables. If we wanted to store a string of several words in a variable, we would have had to use quotation marks to write, e.g., v2="to the world!". That also means that one has too be careful with your spaces: v1= Hello wouldn’t have worked.

Now, just as many programs can accept command line arguments, so too can we pass command line arguments to a bash script. There are special names for the first nine command line arguments passed to a script, and they are $1, $2,…,$9 (and, of course, there are ways to deal with additional arguments). We could change our hello.sh script now to:

#!/bin/bash
echo $1 $2

Running this with

$ ./hello.sh Hello world!

would cause “Hello world!” to be printed to the screen.

I typically do not recommend relying on command line arguments for bash scripts that are going to executing your scientific code / project workflow. This, again, raises serious issues with reproducibility – how will future you know exactly what command you executed? On the other hand, one can very productively use simple bash scripts for all sorts of automated tasks, and this can be handy to know about. For example, suppose you frequently grab data off of some remote compute – scp is a powerful command-line tool, but it’s a bit tiring to always be typing out the boilerplate scp username@remotename:/path/to/dataFileOrDirectory /pathToDesiredSaveLocation, especially when remotename is itself sometimes long. Here’s a scpGetFromSpecificRemote.sh you could do to simplify your life:

#!/bin/bash
remoteBaseName="username@remote.name:~/" #point at your home directory, or modify to point somewhere else
pathToRemoteData=$1
pathToLocalSave=$2
recursivelyCopyDirectoryFlag=$3 # an optional command line argument...if nothing is entered $3 will be an empty string, and the next comparison will be false
if [[ $recursivelyCopyDirectoryFlag == "r" ]]; then
    scp -r ${remoteBaseName}${pathToRemoteData} ${pathToLocalSave}
else
    scp ${remoteBaseName}${pathToRemoteData} ${pathToLocalSave}
fi

As you can see, in the script we have hard-coded some information to save us from writing boilerplate parts of a common command, and used some simple control flow to be able to use the same script to either get files or recursively copy whole directories. Now you can run commands like

$ ./scpGetFromSpecificRemote.sh project1/data/coolFile.nc ~/localPathToProject/data

a pattern that kind of mirrors how the cp command works, and is perhaps convenient for us to use. Between control flow, loops (below), and functions (not discussed here), bash scripts can get extremely powerful and complex. Personally, if I find myself writing a particularly complex script I take it as a sign that I should find another way to organize things.

Loops

One can be much more sophisticated (and programmatic) about this, but at this level we should know that bash provides an easy ability to perform loops inside the script. The basic syntax is something like

for variable in item1 item2 item3
do 
    someCommand
done

These loops can be nested, and this makes for an easy way to loop over a grid of parameters. For instance, here’s a script I once wrote to loop over four different arrays of parameters:

#!/bin/bash
#(some details removed)
n=2048
e=0.01
t=100000
for f in 0 1 2 3 4 
do
for v in 0.00001 0.0001 0.01
do
for p in 3.5 3.6 3.7 3.725 3.75 3.775 3.8 3.81 3.82 
do
for d in 0.01 0.1 1.0 10.0
do
./main.out "-n" $n "-v" $v "-p" $p "-e" $e "-d" $d "-t" $t "-f" $f
done
done
done
done

You can see, here, that the indentation of the code doesn’t matter. But for readability, I should have indented it.

By the way: every time I write a bash script I test it by replacing the command I actually want to run with an “echo” version of the same: then I can run the script and check the on-screen output to make sure the script will do what I want.

What if you want to loop over coupled sets of parameters? Like, you want to simulate for a certain amount of time that depends on the temperature of your simulation, and you want to vary the temperature? There are several ways to accomplish that, and here’s one of them:

#!/bin/bash
p=3.8
temperatures=(0.01 0.008 0.005 0.00385 0.0031 0.0025 0.002)
tauEstimates=(200. 400. 1300. 3000. 5000. 8500. 10000.)
for n in 1024 2048 4096 8192 16384 32768
do
    for i in "${!temperatures[@]}";
    do
        ./voronoi.out -p $p -v "${temperatures[i]}" -i "${tauEstimates[i]}" -n $n
    done
done

Here you see a nested loop over system size (n), along with a loop over an array index which is then used to access a specific element of two different arrays (which are, of course, of the same length – the stuff on the for i... line is a way of asking for the length of an array in bash).

Summary

Level 2 is extraordinary, and even if we stopped here we would have immense power in the palm of our hands. However, we are still fundamentally executing commands one after the other, in serial fashion, and watching the output scroll past our eyes on the screen (and hopefully saving the data somewhere). We often want to run code over independent parameters – the results of one set of calculations do not depend on what happens for another set of parameters – and our computers are multithreaded beasts. Do we really have to only run one thing at a time? If we want parallelization, do we have to hackily open up multiple terminal windows and be running a different shell in each one to cover parameter space?

Surely there’s a better way.

Level 3: Setting and forgetting (nohup and disown)

Background, nohup, and disown on the command line

Let’s spend just a little bit of time learning about “foreground” and “background” jobs. If you were to execute

$ ./someProgram

then that program would run. You would be able to see any output of the program printed to your screen, but other than hitting <ctrl>-c to halt the program you couldn’t really interact with the shell. One thing you could do is to launch the program in the background by appending an ampersand to the command:

$ ./someProgram &

The program is now a background process, so you get your command line prompt again and you could run another command (perhaps even another command also in the background). However, you will notice that that program still outputs to stdout – i.e., even though it is in the background, it will still print output to your screen. You can stop this behavior by redirecting the output from stdout (and, if you like, also stderr) to some other file. For instance:

$ ./someProgram &> outputFile.txt &

Fantastic. Now your program will run in the background, and anything that would have been printed to screen will instead start populating the text file you just specified (note that the &> is the redirection operator that sends both standard output and standard error to the specified file). However, the program is still tied to both the shell it is running in and the terminal the shell is running in. What happens if you close your shell? Or disconnect from the remote server that you are running a job on over ssh? Well, you job will be sent a signal to terminate, just as if you had hit <ctrl>-c on a job you were running normally from the command line.

What we often want to do is set something up to run – on our own machine, or a remote machine – in such a way that we can just set it and forget it, regardless of whether we accidentally close our terminal, or if our ssh connection times out, or… The way to accomplish our aims here are to combine the nohup and disown commands, like so:

$ nohup ./someProgram &> outputFile.txt & disown

Finally, it might be worthwhile to note that you can sometimes write much more compact sequences of commands that might do the same thing. On some systems the two commands

$ ./someProgram.sh &> logFile.txt &
$ nohup ./someProgram.sh &> logFile.txt &

might actually be operationally equivalent: it depends on the particular shell you are using. In particular, it depends on the value of the huponexit variable your shell defines: if it is set to off (which many modern shells do), then the two commands will do the same thing, but if it is on then the nohup approach is needed. A similar comment applies to the use of disown — depending on shell settings, nohup-ed jobs may or may not terminate when logging out, and you can disown those jobs to make sure they persist. Depending on the level of job control permissions you have on different systems, you should use these different settings with care.

Incorporating this into bash scripting

This will be short: a bash script is just a sequence of commands, so anything we can do from the command line we can do in our scripts. For instance, slightly modifying something from above:

#!/bin/bash
temperatures=(0.01 0.008 0.005)
for i in "${!temperatures[@]}";
do
    outfile="T${temperatures[i]}output.txt"
    nohup ./voronoi.out -t "${temperatures[i]}"  &> ${outfile} & disown
done

Running this script will launch all of the jobs in the loop simultaneously to the background, and each will have its on-screen output printed into a different file. What happens if you accidentally launch a job with an infinite while loop, or you made some other mistake and want to cancel your jobs? Become familiar with the top and kill commands!

Summary

Unbelievable – we can now write scripts that, when executed, can loop over arbitrary launches of various programs, and we can decide whether to run those jobs one after the other or simultaneously launch all of them in the background as detached processes. We practically have the power of the sun at our fingertips. But… doing this it’s kind of easy to melt your computer. You probably only want to run X jobs in parallel at a time, so that your computer can still be used for other things. Unfortunately, with the pattern we just learned, you can easily run far more cpu- and RAM-intensive jobs than your system can manage simultaneously.

Surely there’s a better way.