Workload and workflow management

DISCLAIMER: This guide is still in progress. The example scripts below are not very robust, and while they will usually work they do not incorporate best practices for error checking and argument handling. Think of them more as indicative of what can be done than scripts that should be copied verbatim.

Having learned about the basics of bash scripting, we’re ready to think harder about automating some of the more complex, interdependent sets of computational tasks that we encounter in our research. We’ll start with what’s typically called workload management, and then move on to workflow management.

Workload management

There exist many tools for automating and managing computational resources – these often involve scripted ways of submitting jobs to a queue, along with a managing process that is in charge of assigning priority to jobs in that queue. When a system resource becomes available – say, a cpu core or a gpu that the manager knows about – it launches the job at the front of the queue.

Bash scripts as a simple workload manager

NOTE: The functioning of the scripts in this section depend on the version of bash you have, your operating system, etc (see the comments at the end of this subsection about this all being a little bit fragile!). I’ll present two different scripts to do the same thing; I suggest you test each on your system if you want to use bash as a workload manager in this way!

Modern versions of Bash make it possible to write a script that solves some of these problems. For instance, if you have a large number of jobs that you want to send to the background and you want to run no more than X of them run at any one time, you could write something like the following:

#!/bin/bash
set -m # Black magic. Just kidding: let some things behave as if this was an interactive shell (like sending jobs to the background)

maxProcs=4 # maximum number of concurrent processes we want
declare -A currentJobs=( ) # set up an array that we will put jobIDs (pid's) into

# here we'll pretend that our jobs are just "sleep for i seconds"
for i in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
do
    #if we are over our limit of concurrent jobs, wait for one to finish
    if (( ${#currentJobs[@]} >= maxProcs )); then
        wait -p finishedJob -n
        # we just grabbed the id of the job that finished; remove it from our array
        unset currentJobs[$finishedJob]
    fi
    #Add another job to the list, send it to the background, and nohup it
    nohup sleep $i &> "testOutput${i}.txt" &
    currentPID=$! # "$!" lets us get the pid of the job we just launched
    currentJobs[$currentPID]=1
done
wait
# Output to screen when this finished... should be 40 seconds
currentTime="$(date +"%T")"
echo "jobs finished ${currentTime}"

This script nohup’s a bunch of jobs, each of which have their output redirected to a different file, and is configured so that no more than maxProcs number of jobs can run at any one time. If this script was called parallelSubmit.sh one could, furthermore, go ahead and do something like

$ nohup parallelSubmit.sh &> scriptOutput.txt & disown

to launch this script on a remote computer, or in a terminal on your computer you are about to close, etc. Is there a better way to do this? Almost certainly. For instance, here is a seemingly simpler script that (a) may work on older versions of bash and linux for which the “wait -p” functionality doesn’t behave as we want it to, but (b) might not maintain the level of parallelization you are targeting if two jobs finish at the same time:

#!/bin/bash

# maximum number of concurrent processes we want
maxProcs=4
launchedJobs=0

# here we'll pretend that our jobs are just "sleep for i seconds"
for i in {1..100}
do
    (($launchedJobs >= maxProcs)) && wait -n
    
    sleep $i > outputLog.txt & ((launchedJobs++))
done

The above launches jobs (here, again, just “sleep”), each time incrementing the counter. If the number of launched jobs is greater than or equal to the limit, the loop uses the wait command to wait for one of them to finish before proceeding to the next line. Let me be up front, here: this is at the limit of my knowledge of working with Bash scripting;it is at this level of workflow complexity that I start to think tools other than Bash become helpful.

After all, this is a little bit fragile. What happens if a job exits with an error code? What if you want to change the order in which your jobs will run? What if you want to manage jobs based not only on whether a CPU is available, but if a separate GPU resource can be used? It’s all a little fussy, and that where – rather than rolling our own workload manager with a complicated bash script – we can turn to professional-grade workload managers.

Standard workload managers

You will interact with these workload managers almost any time you use a cluster. The details vary depending on what workload manager is being used – here’s an example of a common one. It often involves writing a submission script – specifying what program you want to run, what resources it needs, etc. – and then submitting that script to the Workload Manager, perhaps via a command like

$ sbatch submissionScript.submit

In this context it is often helpful to make use of what we learned in bash scripting to create a base submission script together with a bash script that can loop over parameters and submit many jobs to the queue. Here’s an example that takes a base script, copies a new version of it, then uses the sed command line tool to replace instances of specific words in the base script with the corresponding varable in the bash script.

# !/bin/bash   

for n in 512 2048
    do
    for p in 3.8 3.9
        do
        #copy base file to one with a specialized name
        cp brownianSubmit.submit ./script/Submit_n${n}_p${p}.submit
        #replace any instance of number or perimeter in the new file with a value
        sed -i "s/{number}/$n/g" ./script/Submit_n${n}_p${p}.submit
        sed -i "s/{perimeter}/$p/g" ./script/Submit_n${n}_p${p}.submit
        #submit the new submission file to the workload manager
        condor_submit ./script/Submit_n${n}_p${p}.submit
        echo ./script/Submit_n${n}_p${p}.submit
    done
done

Installing a workload manager locally

I’m about to blow your mind: although you will usually encounter workload managers in the context of working on a cluster, nothing is stopping you from installing and running one on your own machine! In fact, setting them up in the context of a single computer (rather than a network of interconnected computers) is pretty easy. Once you install one and launch it, there will be a scheduler (workload manager) process running in the background, waiting for you to assign it tasks in a queue. You can easily configure them to only ever use a maximum amount of your system’s RAM, or a maximum number of threads, etc.

Summary

The world is now our oyster (is that better or worse than having the power of the sun at our fingertips?). But there is at least one more level we can aspire to. What if we want to create a queue of jobs that are not independent of each other? Say, we want to run analysis scripts only after a corresponding simulation finishes? Or we want to run additional simulations contingent on the outcome of some analysis script we ran on a different simulation’s output data? What if we want something so robust that not only will a job not stop if we disconnect from an ssh session, but it will also automatically pick up where it left off if there is a power outage (or, more plausibly, our IT department forcibly restarts our machines to install a security update)? What if we have our whole project mapped out, and we want the ultimate “set it and forget it” ability?

Surely there is a better way.

Workflow management systems

Welcome to the world not of Workload managers, but of Workflow management systems (WMSs) These are able to do all of the things described in the preceding summary, and are amazing tools. They handle job interdependencies (typically by describing a workflow as a directed acyclic graph), frequently have checkpointing and job-restarting capability, etc. You’ll see them a lot in the context of truly large-scale computational work – for instance, millions of interdependent jobs, where it is a practical certainty that during the execution of at least one of them a system will crash or a power-outage will occur, or…

Sample WMS and python scripting

A WMS might be a bit of overkill for your specific use case, but once again it is completely possible (and, in some cases even reasonable) to install such a thing on your own workstation. One that have used myself for a few projects and that I like is Pegasus, which is conveniently also something offered on all ACCESS NSF supercomputing clusters. Pegasus is built around describing a DAG of jobs in python (which is nice – much easier to write programmatic control flow in python than in bash!), which can then submit its output to the queue of a Workload manager, gracefully handling all of the dependencies, job recoveries, etc. Of course, with such great power comes, well… a bit more complexity in setting everything up and writing the scripts. Here’s a simplified version of a python file I once wrote to do this in the context of a simulation and then an analysis job, ensuring that every analysis job would only run after the parent simulation job successfully finished:

#!/usr/bin/env python
import os
import pwd
import sys
import time
from Pegasus.DAX3 import *

USER = pwd.getpwuid(os.getuid())[0]
basedir=os.getcwd()

# Create a abstract dag
dax = ADAG("voroML")
# Add some workflow-level metadata
dax.metadata("creator", "%s@%s" % (USER, os.uname()[1]))
dax.metadata("created", time.ctime())

# Path the the two executables I'll use
executableName="/home/sussman/repos/voroML/saveTrajectory.out"
executableName2="/home/sussman/repos/voroML/computePhop.out"
# define the executables
saveTraj = Executable(name="saveTraj", arch="x86_64", installed=False)
saveTraj.addPFN(PFN("/"+executableName,"local"))
pHop = Executable(name="phop", arch="x86_64",installed=False)
pHop.addPFN(PFN("/"+executableName2,"local"))

#tell pegasus not to cluster jobs together
saveTraj.addProfile(Profile(Namespace.PEGASUS,"clusters.size",1))
pHop.addProfile(Profile(Namespace.PEGASUS,"clusters.size",1))
dax.addExecutable(saveTraj)
dax.addExecutable(pHop)

# A janky way to descibe the sets of parameters I want to eventually loop over...I removed a lot of stuff here
pvListPairs = []
pvListPairs.append( (3.75,  [0.000016, 0.000033,.000055,.00016,.00033,.00055,.0016,.0033,.0055]) )
pvListPairs.append( (3.8,  [0.0000071, 0.000014,.000024,0.000071, 0.00014,.00024,0.00071, 0.0014,.0024]) )

nn=5000
window=5
tt=5000000
#Do multiple runs for each parameter set
for fidx in range(5):
    for pvList in pvListPairs:
        p0=pvList[0]
        for v0 in pvList[1]:
            # Describe jobs and pass arguments to programs
            stJob=Job(name="saveTraj")
            stJob.addArguments("-f", "%i" % fidx,
                       "-n", "%i" % nn,
                       "-t", "%i" % tt,
                       "-v", "%f" % v0,
                       "-p", "%f" % p0
                       )
            phJob=Job(name="phop")
            phJob.addArguments("-f", "%i" % fidx,
                       "-n", "%i" % nn,
                       "-v", "%f" % v0,
                       "-p", "%f" % p0,
                       "-w", "%i" % window
                       )
            dax.addJob(stJob)
            dax.addJob(phJob)

            #Describe the simplest "A depends on B" relationship in the graph
            dax.depends(parent=stJob,child=phJob)

# Now that the DAG is built, output information in a format pegasus can use to run everything
f = open("baseDax.xml", "w")
dax.writeXML(f)

Summary

Pretty cool.

Back to Managing computational research...