Sussman Lab

Bash scripting and command line jobs

Computational research often involves running many programs (executing simulations, analyzing raw data, generating plots, etc.) , and it is worth getting comfortable doing all of this from the command line. Below we’ll walk through what I think of as various levels of running programs from the command line – not levels in terms of their value, but in terms of how robust, flexible, and/or reproducible they can be. All of the below assumes a Linux-like environment (which might actually be Linux, or MacOS, or Windows Subsystems for Linux), and I’ll focus on the Bash shell – there are more modern shells you can use (for instance, MacOS has made zsh its default instead of Bash), and I think in some ways they are better. But Bash is still everywhere, and it’s good to learn the basics.

This is meant to be read in order if you aren’t familiar with most of these concepts, but feel free to jump to any of the main sections:
Level 0: Command line jobs
Level 1: Simple bash scripts
Level 2: More complex bash scripts
Level 3: Setting and forgetting
Level 4: Workload management systems
Level 5: Workflow management systems

Level 0: Running a job from the command line

The most basic way to run a job from the command line is to literally run it from the command line in your terminal. Suppose you are in a directory with an executable program called doSomethingAwesome. You could execute that program by typing ./doSomethingAwesome and hitting return. Throughout this page I’ll visualize this kind of “execute something in the shell” by using a $ to denote your command line prompt and showing something like:

$ ./doSomethingAwesome

Command line flags

Perhaps that program has some command-line flags that can be used to control its behavior – perhaps n sets the number of particles, i sets the number of simulation timesteps, r sets the density, and m sets the type of boundary conditions to use. You could then run

$ ./doSomethingAwesome -n 10000 -i 1000000 -r 1.2 -m periodicBoundaryConditions 

to simulate ten thousand particles for a million timesteps under periodic boundary conditions at a density of 1.2. Or whatever. You would then have the pleasing experience of watching any output of your program scroll by on the screen.

Summary

Running things directly from the command line like this is very powerful and extremely useful – I do it all the time. But it also has real issues and limitations. Unless you are meticulous you do not have an indefinite record of the commands you actually entered (a problem for scientific reproducibility), and if you have to run many simulations over a range of parameters you will be tied to your keyboard – constantly typing parameter values, hitting enter, waiting, and repeating.

Surely there’s a better way.

Level 1: Creating and executing a simple bash script

Bash is a ubiquitous shell (a “shell” is just a program that takes commands from a keyboard and then hands those commands over to the operating system to perform) that you can access on any modern (and many pre-modern) operating systems. A bash script is just a plain text file (by convention they’re given .sh endings, but that is irrelevant) that contains a sequence of commands. Here I’ll go over the absolute basics; there are many tutorials (e.g., here is one of the first google references, and on a skim it seems very reasonable).

The simplest bash script

Let’s create a file, hello.sh, which contains the following two lines.

#!/bin/bash
echo "Hello world!"

The first line tells the system to interpret the rest of the file in bash, and it must be exactly this, with no spaces, on the very first line of the file. If we were to try to execute this by typing ./hello.sh and hitting return, nothing would happen. It’s just a plain text file, so we need to permit our shell to execute this program. On the command line do this: chmod +x ~/bin/hello.sh.

Now, if we type ./hello.sh we should see our delightful message printed on our screen.

A script with multiple commands

The first line in a bash script is the mandatory #!/bin/bash, but after that you can write not just one but any number of commands. For instance, suppose you wanted to write a script that first printed “Hello world!” to screen, then searched all files within the current directory and all sub-directories for occurrences of the word “potato,” created a file named “potatoFile.txt” that contains on each line the file that has this word and the contents of the line of the file with that word, and then opened that new file in the vim editor (I am not responsible for why you might want to do this). To accomplish your aim, you could create the findPotato.sh file containing

#!/bin/bash
echo "Hello world!"
grep -r "potato" . > potatoFile.txt
vim potatoFile.txt

Once you make this executable by chmod-ing it, you can run it and achieve your desires – first line is the mandatory opener of bash scripts, the second prints “Hello world!” to screen, the third line uses the “grep” command line tool to search for instances of “potato” in your files and then re-directs the output to a new file with the given name, and the fourth line uses the vim program to open that file.

Again: a bash script is just a sequence of commands, and they will be executed one after the other. To learn more about the most common command line tools, this page is a good place to start. If nothing else, learn commands for copying files, interacting with a remote, creating and moving around directories, and then take a look at “grep” and “sed”… and maybe “awk”.

Running a sequence of programs in a plain script

Since a bash script is just a sequence of commands that will be executed, we can already do something useful. Instead of sitting at the keyboard and periodically typing the next job in a parameter sweep, we can set up a script to go through all of the jobs we want to submit. Suppose I have a pattern in which I run a simulation, use a different program to analyze the data, and then I repeat for a new parameter set. We could make a bash script that looks like:

#!/bin/bash
# Anything after the first line that starts with a hash is a comment, and won't affect the script
# The commands below used versions (or commits) XX and YY of the two programs
./doSomethingAwesome -n 10000 -i 1000000 -r 1.0
./analyzeTheData -n 10000 -i 1000000 -r 1.0
./doSomethingAwesome -n 10000 -i 1000000 -r 1.1
./analyzeTheData -n 10000 -i 1000000 -r 1.1
./doSomethingAwesome -n 10000 -i 1000000 -r 1.3
./analyzeTheData -n 10000 -i 1000000 -r 1.3
./doSomethingAwesome -n 10000 -i 1000000 -r 1.4
./analyzeTheData -n 10000 -i 1000000 -r 1.4
./doSomethingAwesome -n 10000 -i 1000000 -r 1.5
./analyzeTheData -n 10000 -i 1000000 -r 1.5

We run this, and then sit back and watch the results of 10 commands execute one after the other (each taking however long they take). Better yet, we can now leave and grab a coffee while the computer does some work for us. Even better, and more seriously, we now have a file with a record of exactly what code we ran (thanks to the comments we left on the third line of the file) and over what parameters.

Summary

Level 1 is a fantastic step forward in its power and flexibility. It is also kind of tedious – do you really want to write out by hand all of the command line arguments you were going to have to type? What if you want to scan over many different parameters in a grid search? It is also not exactly the most human readable – if you had a typo in one line out of a list of tens or hundreds or thousands of simulations, would you even notice?

Surely there’s a better way.

Level 2: More complex scripts

You start out thinking of bash scripts as containing a simple sequence of commands, but bash scripts can use a variety of programmatic concepts and become quite sophisticated. This section is not meant to be at all exhaustive in describing what a bash script can do, but I just want to give a little flavor.

Command line options and variables

First, we can easily set local variables in our bash script: we just write down the name, an equals sign, and a value. One can then refer to these values in later parts of the bash script by prepending a dollar sign. For instance, we could modify our hello.sh example to be:

#!/bin/bash
v1=Hello
v2=world!
echo $v1 $v2

Executing this would, as you might expect, cause “Hello world!” to be printed to the screen. One has to be a little careful, by the way: bash wants to interpret everything as a string and by default uses spaces to separate variables. If we wanted to store a string of several words in a variable, we would have had to use quotation marks to write, e.g., v2="to the world!". That also means that one has too be careful with your spaces: v1= Hello wouldn’t have worked.

Now, just as many programs can accept command line arguments, so too can we pass command line arguments to a bash script. There are special names for the first nine command line arguments passed to a script, and they are $1, $2,…,$9 (and, of course, there are ways to deal with additional arguments). We could change our hello.sh script now to:

#!/bin/bash
echo $1 $2

Running this with

$ ./hello.sh Hello world!

would cause “Hello world!” to be printed to the screen.

I typically do not recommend relying on command line arguments for bash scripts that are going to executing your scientific code / project workflow. This, again, raises serious issues with reproducibility – how will future you know exactly what command you executed? On the other hand, one can very productively use simple bash scripts for all sorts of automated tasks, and this can be handy to know about. For example, suppose you frequently grab data off of some remote compute – scp is a powerful command-line tool, but it’s a bit tiring to always be typing out the boilerplate scp username@remotename:/path/to/dataFileOrDirectory /pathToDesiredSaveLocation, especially when remotename is itself sometimes long. Here’s a scpGetFromSpecificRemote.sh you could do to simplify your life:

#!/bin/bash
remoteBaseName="username@remote.name:~/" #point at your home directory, or modify to point somewhere else
pathToRemoteData=$1
pathToLocalSave=$2
recursivelyCopyDirectoryFlag=$3 # an optional command line argument...if nothing is entered $3 will be an empty string, and the next comparison will be false
if [[ $recursivelyCopyDirectoryFlag == "r" ]]; then
    scp -r ${remoteBaseName}${pathToRemoteData} ${pathToLocalSave}
else
    scp ${remoteBaseName}${pathToRemoteData} ${pathToLocalSave}
fi

As you can see, in the script we have hard-coded some information to save us from writing boilerplate parts of a common command, and used some simple control flow to be able to use the same script to either get files or recursively copy whole directories. Now you can run commands like

$ ./scpGetFromSpecificRemote.sh project1/data/coolFile.nc ~/localPathToProject/data

a pattern that kind of mirrors how the cp command works, and is perhaps convenient for us to use. Between control flow, loops (below), and functions (not discussed here), bash scripts can get extremely powerful and complex. Personally, if I find myself writing a particularly complex script I take it as a sign that I should find another way to organize things.

Loops

One can be much more sophisticated (and programmatic) about this, but at this level we should know that bash provides an easy ability to perform loops inside the script. The basic syntax is something like

for variable in item1 item2 item3
do 
    someCommand
done

These loops can be nested, and this makes for an easy way to loop over a grid of parameters. For instance, here’s a script I once wrote to loop over four different arrays of parameters:

#!/bin/bash
#(some details removed)
n=2048
e=0.01
t=100000
for f in 0 1 2 3 4 
do
for v in 0.00001 0.0001 0.01
do
for p in 3.5 3.6 3.7 3.725 3.75 3.775 3.8 3.81 3.82 
do
for d in 0.01 0.1 1.0 10.0
do
./main.out "-n" $n "-v" $v "-p" $p "-e" $e "-d" $d "-t" $t "-f" $f
done
done
done
done

You can see, here, that the indentation of the code doesn’t matter. But for readability, I should have indented it.

By the way: every time I write a bash script I test it by replacing the command I actually want to run with an “echo” version of the same: then I can run the script and check the on-screen output to make sure the script will do what I want.

What if you want to loop over coupled sets of parameters? Like, you want to simulate for a certain amount of time that depends on the temperature of your simulation, and you want to vary the temperature? There are several ways to accomplish that, and here’s one of them:

#!/bin/bash
p=3.8
temperatures=(0.01 0.008 0.005 0.00385 0.0031 0.0025 0.002)
tauEstimates=(200. 400. 1300. 3000. 5000. 8500. 10000.)
for n in 1024 2048 4096 8192 16384 32768
do
    for i in "${!temperatures[@]}";
    do
        ./voronoi.out -p $p -v "${temperatures[i]}" -i "${tauEstimates[i]}" -n $n
    done
done

Here you see a nested loop over system size (n), along with a loop over an array index which is then used to access a specific element of two different arrays (which are, of course, of the same length – the stuff on the for i... line is a way of asking for the length of an array in bash).

Summary

Level 2 is extraordinary, and even if we stopped here we would have immense power in the palm of our hands. However, we are still fundamentally executing commands one after the other, in serial fashion, and watching the output scroll past our eyes on the screen (and hopefully saving the data somewhere). We often want to run code over independent parameters – the results of one set of calculations do not depend on what happens for another set of parameters – and our computers are multithreaded beasts. Do we really have to only run one thing at a time? If we want parallelization, do we have to hackily open up multiple terminal windows and be running a different shell in each one to cover parameter space?

Surely there’s a better way.

Level 3: Setting and forgetting (nohup and disown)

Background, nohup, and disown on the command line

Let’s spend just a little bit of time learning about “foreground” and “background” jobs. If you were to execute

$ ./someProgram

then that program would run. You would be able to see any output of the program printed to your screen, but other than hitting <ctrl>-c to halt the program you couldn’t really interact with the shell. One thing you could do is to launch the program in the background by appending an ampersand to the command:

$ ./someProgram &

The program is now a background process, so you get your command line prompt again and you could run another command (perhaps even another command also in the background). However, you will notice that that program still outputs to stdout – i.e., even though it is in the background, it will still print output to your screen. You can stop this behavior by redirecting the output from stdout to some other file. For instance:

$ ./someProgram &> outputFile.txt &

Fantastic. Now your program will run in the background, and anything that would have been printed to screen will instead start populating the text file you just specified. However, the program is still tied to both the shell it is running in and the terminal the shell is running in. What happens if you close your shell? Or disconnect from the remote server that you are running a job on over ssh? Well, you job will be sent a signal to terminate, just as if you had hit <ctrl>-c on a job you were running normally from the command line.

What we often want to do is set something up to run – on our own machine, or a remote machine – in such a way that we can just set it and forget it, regardless of whether we accidentally close our terminal, or if our ssh connection times out, or… The way to accomplish our aims here are to combine the nohup and disown commands, like so:

$ nohup ./someProgram &> outputFile.txt & disown

The way we’ve set this up, both stdout and stderr will be sent to the same outputFile – you can of course configure things to send these to different files.

Incorporating this into bash scripting

This will be short: a bash script is just a sequence of commands, so anything we can do from the command line we can do in our scripts. For instance, slightly modifying something from above:

#!/bin/bash
temperatures=(0.01 0.008 0.005)
for i in "${!temperatures[@]}";
do
    outfile="T${temperatures[i]}output.txt"
    nohup ./voronoi.out -t "${temperatures[i]}"  &> ${outfile} & disown
done

Running this script will launch all of the jobs in the loop simultaneously to the background, and each will have its on-screen output printed into a different file. What happens if you accidentally launch a job with an infinite while loop, or you made some other mistake and want to cancel your jobs? Become familiar with the top and kill commands!

Summary

Unbelievable – we can now write scripts that, when executed, can loop over arbitrary launches of various programs, and we can decide whether to run those jobs one after the other or simultaneously launch all of them in the background as detached processes. We practically have the power of the sun at our fingertips. But… doing this it’s kind of easy to melt your computer. You probably only want to run X jobs in parallel at a time, so that your computer can still be used for other things. Unfortunately, with the pattern we just learned, you can easily run far more cpu- and RAM-intensive jobs than your system can manage simultaneously.

Surely there’s a better way.

Level 4: Workload managers

There exist many tools for automating and managing computational resources – these often involve scripted ways of submitting jobs to a queue, along with a managing process that is in charge of assigning priority to jobs in that queue. When a system resource becomes available – say, a cpu core or a gpu that the manager knows about – it launches the job at the front of the queue.

Bash scripts as a simple workload manager

Modern versions of Bash make it possible to write a script that solves some of these problems. For instance, if you have a large number of jobs that you want to send to the background and you want to run no more than X of them run at any one time, you could write something like the following:

#!/bin/bash
set -m # Black magic. Just kidding: let some things behave as if this was an interactive shell (like sending jobs to the background)

# maximum number of concurrent processes we want
maxProcs=4
# set up an array that we will put jobIDs (pid's) into
declare -A currentJobs=( )

# here we'll pretend that our jobs are just "sleep for i seconds"
for i in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
do
    #if we are over our limit of concurrent jobs, wait for one to finish
    if (( ${#currentJobs[@]} >= maxProcs )); then
        wait -p finishedJob -n
        # we just grabbed the id of the job that finished; remove it from our array
        unset currentJobs[$finishedJob]
    fi
    #Add another job to the list, send it to the background, and nohup it
    nohup sleep $i &> "testOutput${i}.txt" &
    currentPID=$! # "$!" lets us get the pid of the job we just launched
    currentJobs[$currentPID]=1
done
wait
# Output to screen when this finished... should be 40 seconds
currentTime="$(date +"%T")"
echo "jobs finished ${currentTime}"

This script nohup’s a bunch of jobs, each of which have their output redirected to a different file, and is configured so that no more than maxProcs number of jobs can run at any one time. If this script was called parallelSubmit.sh one could, furthermore, go ahead and do something like

$ nohup parallelSubmit.sh &> scriptOutput.txt & disown

to launch this script on a remote computer, or in a terminal on your computer you are about to close, etc. Let me be up front, here: this is at the limit of my knowledge of working with Bash scripting. Is there a better way to do this? Almost certainly. But this is the point at which I start to think tools other than Bash become helpful.

After all, this is a little bit fragile. What happens if a job exits with an error code? What if you want to change the order in which your jobs will run? What if you want to manage jobs based not only on whether a CPU is available, but if a separate GPU resource can be used? It’s all a little fussy, and that where – rather than rolling our own workload manager with a complicated bash script – we can turn to professional-grade workload managers.

Standard workload managers

You will interact with these workload managers almost any time you use a cluster. The details vary depending on what workload manager is being used – here’s an example of a common one. It often involves writing a submission script – specifying what program you want to run, what resources it needs, etc. – and then submitting that script to the Workload Manager, perhaps via a command like

$ sbatch submissionScript.submit

In this context it is often helpful to make use of what we learned in bash scripting to create a base submission script together with a bash script that can loop over parameters and submit many jobs to the queue. Here’s an example that takes a base script, copies a new version of it, then uses the sed command line tool to replace instances of specific words in the base script with the corresponding varable in the bash script.

# !/bin/bash   

for n in 512 2048
    do
    for p in 3.8 3.9
        do
        #copy base file to one with a specialized name
        cp brownianSubmit.submit ./script/Submit_n${n}_p${p}.submit
        #replace any instance of number or perimeter in the new file with a value
        sed -i "s/{number}/$n/g" ./script/Submit_n${n}_p${p}.submit
        sed -i "s/{perimeter}/$p/g" ./script/Submit_n${n}_p${p}.submit
        #submit the new submission file to the workload manager
        condor_submit ./script/Submit_n${n}_p${p}.submit
        echo ./script/Submit_n${n}_p${p}.submit
    done
done

Installing a workload manager locally

I’m about to blow your mind: although you will usually encounter workload managers in the context of working on a cluster, nothing is stopping you from installing and running one on your own machine! In fact, setting them up in the context of a single computer (rather than a network of interconnected computers) is pretty easy. Once you install one and launch it, there will be a scheduler (workload manager) process running in the background, waiting for you to assign it tasks in a queue. You can easily configure them to only ever use a maximum amount of your system’s RAM, or a maximum number of threads, etc.

Summary

The world is now our oyster (is that better or worse than having the power of the sun at our fingertips?). But there is at least one more level we can aspire to. What if we want to create a queue of jobs that are not independent of each other? Say, we want to run analysis scripts only after a corresponding simulation finishes? Or we want to run additional simulations contingent on the outcome of some analysis script we ran on a different simulation’s output data? What if we want something so robust that not only will a job not stop if we disconnect from an ssh session, but it will also automatically pick up where it left off if there is a power outage (or, more plausibly, our IT department forcibly restarts our machines to install a security update)? What if we have our whole project mapped out, and we want the ultimate “set it and forget it” ability?

Surely there is a better way.

Level 5: Workflow management systems

Welcome to the world not of Workload managers, but of Workflow management systems (WMSs) These are able to do all of the things described in the preceding summary, and are amazing tools. They handle job interdependencies (typically by describing a workflow as a directed acyclic graph), frequently have checkpointing and job-restarting capability, etc. You’ll see them a lot in the context of truly large-scale computational work – for instance, millions of interdependent jobs, where it is a practical certainty that during the execution of at least one of them a system will crash or a power-outage will occur, or…

Sample WMS and python scripting

A WMS might be a bit of overkill for your specific use case, but once again it is completely possible (and, in some cases even reasonable) to install such a thing on your own workstation. One that have used myself for a few projects and that I like is Pegasus, which is conveniently also something offered on all ACCESS NSF supercomputing clusters. Pegasus is built around describing a DAG of jobs in python (which is nice – much easier to write programmatic control flow in python than in bash!), which can then submit its output to the queue of a Workload manager, gracefully handling all of the dependencies, job recoveries, etc. Of course, with such great power comes, well… a bit more complexity in setting everything up and writing the scripts. Here’s a simplified version of a python file I once wrote to do this in the context of a simulation and then an analysis job, ensuring that every analysis job would only run after the parent simulation job successfully finished:

#!/usr/bin/env python
import os
import pwd
import sys
import time
from Pegasus.DAX3 import *

USER = pwd.getpwuid(os.getuid())[0]
basedir=os.getcwd()

# Create a abstract dag
dax = ADAG("voroML")
# Add some workflow-level metadata
dax.metadata("creator", "%s@%s" % (USER, os.uname()[1]))
dax.metadata("created", time.ctime())

# Path the the two executables I'll use
executableName="/home/sussman/repos/voroML/saveTrajectory.out"
executableName2="/home/sussman/repos/voroML/computePhop.out"
# define the executables
saveTraj = Executable(name="saveTraj", arch="x86_64", installed=False)
saveTraj.addPFN(PFN("/"+executableName,"local"))
pHop = Executable(name="phop", arch="x86_64",installed=False)
pHop.addPFN(PFN("/"+executableName2,"local"))

#tell pegasus not to cluster jobs together
saveTraj.addProfile(Profile(Namespace.PEGASUS,"clusters.size",1))
pHop.addProfile(Profile(Namespace.PEGASUS,"clusters.size",1))
dax.addExecutable(saveTraj)
dax.addExecutable(pHop)

# A janky way to descibe the sets of parameters I want to eventually loop over...I removed a lot of stuff here
pvListPairs = []
pvListPairs.append( (3.75,  [0.000016, 0.000033,.000055,.00016,.00033,.00055,.0016,.0033,.0055]) )
pvListPairs.append( (3.8,  [0.0000071, 0.000014,.000024,0.000071, 0.00014,.00024,0.00071, 0.0014,.0024]) )

nn=5000
window=5
tt=5000000
#Do multiple runs for each parameter set
for fidx in range(5):
    for pvList in pvListPairs:
        p0=pvList[0]
        for v0 in pvList[1]:
            # Describe jobs and pass arguments to programs
            stJob=Job(name="saveTraj")
            stJob.addArguments("-f", "%i" % fidx,
                       "-n", "%i" % nn,
                       "-t", "%i" % tt,
                       "-v", "%f" % v0,
                       "-p", "%f" % p0
                       )
            phJob=Job(name="phop")
            phJob.addArguments("-f", "%i" % fidx,
                       "-n", "%i" % nn,
                       "-v", "%f" % v0,
                       "-p", "%f" % p0,
                       "-w", "%i" % window
                       )
            dax.addJob(stJob)
            dax.addJob(phJob)

            #Describe the simplest "A depends on B" relationship in the graph
            dax.depends(parent=stJob,child=phJob)

# Now that the DAG is built, output information in a format pegasus can use to run everything
f = open("baseDax.xml", "w")
dax.writeXML(f)

Summary

Pretty cool.