BlueBEAR Job Submission

If you have any problems using the cluster, please open a Service Desk call and mention 'BlueBEAR' in the description. Please give as much detail as possible about the problem. Please see the knowledge base article How to submit a good fault report.

Jobs on the cluster are under the control of a scheduling system (Slurm) to optimise the throughput of the service. This means that there are some classes of work, for example, that require repeated access to specified nodes, that are not appropriate for this service. IT Services may be able to offer alternative facilities for this type of work; contact the IT Service Desk in the first instance to discuss any such requirements. The scheduling system is configured to offer an equitable distribution of resources over time to all users. The key means by which this is achieved are:

  • Jobs are scheduled according to the QOS (Quality of Service) and the resources that are requested. See the Job Script Options section of this help page for more details.
  • Jobs are not necessarily run in the order in which they are submitted. The scheduling system runs a fair-share policy, whereby users associated with projects that have not used the system for some time will be given a higher priority than projects who have recently made heavy use of the system. This means that it is not possible for any one project to block the system just by submitting multiple jobs.
  • Jobs requiring a large number of cores and/or long walltime will have to queue until the requested resources become available. The system will run smaller jobs until all of the resources that have been requested for the job become available - this is known as backfill - and hence it is beneficial to specify a realistic walltime for a job.

There are limits on the resources available to individual jobs and to the overall resources in use at any time by a single user. See the Resource Limits section of this page for details.

Note:

  • All commands on this page are indicated by the prompt symbol $. The prompt symbol does not need to be copied when copying a command into your terminal.
  • If you were a user of BlueBEAR prior to June 2017, you are probably familiar with the old scheduler MOAB. See the Migrating Old Job Scripts section for assistance on converting your old job scripts to Slurm.

Job Handling

Submitting a job

The command to submit a job is sbatch. This reads its input from a file. The job is submitted to the scheduling system, using the requested resources, and will run on the first available node(s) that are able to provide the resources requested. For example, to submit the set of commands contained in the file myscript, use the command:

$ sbatch myscript

The system will return a job number, for example:

Submitted batch job 55260

Slurm is aware of your current working directory when submitting the job so there is no need to manually specify it.

Upon completion of the job, there will be a single output file in the directory from which you submitted the job and it will follow the structure slurm-55260.out.

So, for the job above the output file would be:

slurm-55260.out

The above will contain standard out and standard error. You can specify standard error to be directed to a separate file using the --error option in your job script (See Job Script Options for more details about how to add job options). For example, adding the following line to your job script would produce the error file slurm-55260.err:

#SBATCH --error=slurm-%j.err

%j is automatically converted to the Slurm job ID.

Cancelling a job

To cancel a queued or running job use the scancel command and supply it with the job ID that is to be cancelled. For example, to cancel the previous job:

$ scancel 55260

Job Script Options

There is a collection of such examples within BlueBEAR which is located at the directory within the following environment variable:

${BB_EXAMPLES}

Instructions on how to interact with the examples can be found in the README file within the above directory.

Application-specific examples can be found on the BlueBEAR Applications page.

An example job script (${BB_EXAMPLES}/matlab/sbatch.sh)

#!/bin/bash
#SBATCH --ntasks 1
#SBATCH --time 5:0
#SBATCH --qos bbshort
#SBATCH --mail-type ALL

set -e

module purge; module load bluebear # this line is required
module load apps/matlab/r2017a

matlab -nodisplay -r cvxtest

An explanation of the above job script

  • #!/bin/bash - run the job using GNU Bourne Again Shell (the same shell as the logon nodes).
  • #SBATCH --ntasks 1 - run the job with one core.
  • #SBATCH --time 5:0 - run the job with a walltime of 5 minutes.
  • #SBATCH --qos bbshort - run the job in the bbshort QOS. See the Job QOS section for more details.
  • #SBATCH --mail-type ALL - Notify a user about all events regarding the job.
  • set -e - This will make your script fail on first error. This is recommmended for debugging.
  • module purge; module load bluebear - This line is required for reproducibility of results because it makes sure that only the default BlueBEAR modules are loaded at this point in the script. This will ignore any modules loaded in files like .bashrc. Include this line before any other module load statements in job scripts.
  • module load apps/matlab/r2017a - Loads the Matlab 2017a module into the environment.
  • matlab -nodisplay -r cvxtest - The command to run the Matlab example.

If you wish to run this example, follow these instructions:

$ cp -r ${BB_EXAMPLES}/matlab ~/matlab-test
$ cd ~/matlab-test
$ sbatch.sh

The above is a simple example, the options and commands can be as complex as necessary.

All of the options that can be set can be viewed on Slurm's documentation for sbatch.

You can add any of these options as command line arguments but it is recommended that you add all of your options into your job script for ease of reproducibility.

Job QOS

Job QOS is similar to the concept of 'Job Queues'.

There are four QOS to which you can submit jobs. These are:

  • bbdefault
  • bbshort
  • bblargemem
  • bbgpu

You can specify which QOS you want by adding the following line to your job script:

#SBATCH --qos bbshort

All QOS (with the exception of bbshort which is 10 minutes) have a maximum walltime of 10 days.

bbdefault

This is the default QOS if no --qos is provided in the job script. This QOS is made up of different types of node such as:

  • 120GB 24 cores
  • 120GB 20 cores
  • 32GB 16 cores

You can request the amount of memory you want using --mem and --mem-per-cpu. If neither of these two options are included in your job then --mem-per-cpu is set to 4096MB by default.

You can specify these in megabytes (M), gigabytes (G) or terabytes (T) with the default unit being M if none is given. For example:

#SBATCH --mem 10G # 10GB
or
#SBATCH --mem 500 # 500MB

bbshort

This QOS contains all nodes in the cluster and is the fastest way to get your job to run. The maximum walltime is 10 minutes.

bblargemem

This QOS contains a mixture of large memory nodes. Some of these nodes have a maximum of 249GB of RAM and others have a maximum of 498GB RAM. It is important to only specify what you need.

If you see the following error, it is because you must be registered to use bblargemem:

sbatch: error: Batch job submission failed: Invalid qos specification

bbgpu

This QOS contains the following GPUs which can be specified using the --gres option. The Tesla P100 architecture is compute-focused which makes it more suitable for machine learning or network-based applications while the Quadro architecture is more appropriate for rendering tasks.

Add one of the following #SBATCH lines to your job script:

  • Nvidia Tesla P100 (#SBATCH --gres gpu:p100:1)
  • Nvidia Quadro 5000 (#SBATCH --gres gpu:q5000:1)

If you see the following error, it is because you must be registered to use bbgpu:

sbatch: error: Batch job submission failed: Invalid qos specification

Dedicated Resources

Some research groups have dedicated resources in BlueBEAR. To see what jobs are running in your dedicated QOS run:

        view_qos <name>

Replace <name> with the name of your QOS. This can only be run if you are a member of that QOS.

CaStLeS Resources

The University has invested substantial funds in an initiative called CaStLeS. As part of this there is a dedicated QOS for CaStLeS. These resources are reserved exclusively for the use of research groups carrying out research in the life sciences and governed by academics through the CaStLeS Executive and Strategic Oversight Group.

To access the large memory CaStLeS node use:

        #SBATCH --mem 500G

To access the Nvidia Tesla P100 node use:

        #SBATCH --gres gpu:p100:1

Monitoring Your Jobs

There are a number of ways to monitor the current status of your job. You can view what's going on by issuing any one of the following commands:

squeue

$ squeue -j 76621

squeue is Slurm's command for viewing the status of your jobs. This shows information such as the job's ID and name, the QOS used (the "Partition"), the user that submitted the job, time elapsed and the number of nodes being used.

showq

$ showq

If you were a user of BlueBEAR before June 2017 then showq should be familiar already. Now, it is a wrapper to squeue with the output converted to that of the old showq command. This shows active, eligible and other jobs. Active jobs are currently running, eligible are jobs waiting to run and other jobs are jobs that have recently completed/been cancelled.

sacct

$ sacct -j 55620

sacct shows accounting data on the job pulled from Slurm's accounting database. This can be used to view information on historical jobs for which commands like squeue or scontrol would not show any information.

scontrol

$ scontrol show jobid=55620

scontrol is a powerful interface that provides an advanced amount of detail regarding the status of your job. The show command within scontrol can be used to view details regarding a specific job.

Personal preference and circumstances will dictate which one you need to use.

Use Local Disk Space

If a job uses significant I/O (Input/Output) then files should be created using the local disk space and only written back to the final directory when the job is completed. This is particularly important if a job makes heavy use of disk for scratch or work files. Heavy I/O to GPFS filestore such as the personal filestore can cause poor interactive response on the cluster for all users.

There is a directory /scratch which is local to each node that can be used for temporary files that are associated with a job.

For jobs that are running on a single node, this filestore can also be used for input and output files for a job. Since /scratch is not shared across nodes it cannot be used for parallel jobs that use multiple nodes where all of the processes need to be able to read from or write to a shared filestore; it can be used for multi-core jobs on a single node.

To use scratch space, include the following lines at the start of your job script (after the #SBATCH headers):

BB_WORKDIR=$(mktemp -d /scratch/${USER}_${SLURM_JOBID}.XXXXXX)
export TMPDIR=${BB_WORKDIR}

These two lines will create a directory for you based on your username (USER), the ID of the job you are running (SLURM_JOBID) followed by a period and six random letters. The use of random letters is recommended for security reasons. This is then exported to the environment as TMPDIR. Many applications utilise TMPDIR for temporary storage while it is running.

If you want to copy your files back to the current directory at the end, insert this line:

test -d ${BB_WORKDIR} && /bin/cp -r ${BB_WORKDIR} ./

This checks that the directory still exists (using test -d) then copies it and everything in it (using -r, for recursive copy) to the current directory (.) if it does.

Then, at the end of your job script, insert the line:

test -d ${BB_WORKDIR} && /bin/rm -rf ${BB_WORKDIR}

This checks that the directory still exists (using test -d) then removes it if it does.

Associate Jobs with Projects

Every job has to be associated with a project to ensure the equitable distribution of the limited resources on the cluster. Project owners and members will have been issued a project code for each registered project, and only usernames authorised by the project owner will be able to run jobs using that project code.

In most cases a username on the cluster is only associated with one project, and any jobs submitted by that user will be associated with their sole project and no project code needs to be specified in the job script.

If a user is registered on more than one project then it should be specified using the --account option followed by the project code. Slurm refers to projects as accounts. For example, if your project is project01 then add the following line to your job script:

#SBATCH --account=project01

If a job is submitted using an invalid project, either because the project does not exist or the username is not authorised to use that project, then the job will be rejected with the following error:

sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified

To view the project that the job is running under, issue the command (assuming a job ID of 55620):

$ sacct -j 55620

Array Jobs

Array jobs are an efficient way of submitting many similar jobs that perform the same work using the same script but on different data sets. Sub-jobs are the jobs created by an array job and are identified by an array job ID and an index. For example, if 55620_1 is an identifier, the number 55620 is a job array ID, and 1 is the sub-job.

In Slurm, there are different environment variables that can be used to dynamically keep track of these identifiers.

  • SLURM_ARRAY_JOB_ID will be set to the first job ID of the array.
  • SLURM_ARRAY_TASK_ID will be set to the job array index value.
  • SLURM_ARRAY_TASK_COUNT will be set to the number of tasks in the job array.
  • SLURM_ARRAY_TASK_MAX will be set to the highest job array index value.
  • SLURM_ARRAY_TASK_MIN will be set to the lowest job array index value.

Example array job (${BB_EXAMPLES}/array)

#!/bin/bash
#SBATCH --time 5:0
#SBATCH --qos bbshort
#SBATCH --array 2-500

set -e
module purge; module load bluebear
module load apps/python/2.7.11
cp ${BB_EXAMPLES}/array/analyse.py .
echo "${SLURM_JOB_ID}: Job ${SLURM_ARRAY_TASK_ID} of ${SLURM_ARRAY_TASK_MAX} in the array"
python analyse.py ${SLURM_ARRAY_TASK_ID}

  • #SBATCH --array 2-500 tells Slurm that this job is an array job and that it should run 499 sub jobs (with IDs 2, 3, 4, ..., 498, 499, 500).
  • SLURM_ARRAY_TASK_COUNT will be set to 499.
  • SLURM_ARRAY_TASK_ID will be set to 2 and incremented for each sub job that is completed.
  • SLURM_ARRAY_TASK_MIN will be set to 2.
  • SLURM_ARRAY_TASK_MAX will be set to 500.
  • SLURM_ARRAY_JOB_ID will be set to the job ID provided by running the sbatch command.

The above job is similar to the one that can be found in the Job Script Options section. Visit that section for a detailed explanation of the workings of a script similar to the one above.

Visit the Job Array Support section of the Slurm documentation for more details on how to carry out an array job.

Multi-core Jobs

There are several ways to run a multi-core job using Slurm but the methods we recommend are to use the options --ntasks and --nodes.

For most people, it is enough to specify only --ntasks. This is because Slurm is sophisticated enough to take your job and spread it across as many nodes as necessary to meet the number of cores that the job requires. This is practical for OpenMPI jobs and means that there is a greater efficiency of use across the cluster (as multiple users can share nodes). For example, adding the following line will request a job with 20 cores on a dynamic number of nodes:

#SBATCH --ntasks 20

The default setting --cpus-per-task=1 means that the above will request 20 cores.

For jobs that require multiple cores but must remain on a certain number of nodes (e.g. OpenMP jobs which can only run on a single node), the option --nodes can be specified with a minimum and maximum number of nodes. For example, adding the following line will enforce only one node is requested:

#SBATCH --nodes 1

Whilst this will enforce at most two nodes are requested:

#SBATCH --nodes 1-2

The environment variable SLURM_NTASKS is set to the number of tasks requested.  For single node, multi-core jobs (e.g. threaded applications) this can be used directly to make the software fit the resources assigned to the job.  For example:

./my_application --threads=$SLURM_NTASKS

It is important to note that not all jobs scale well over multiple cores and sadly it is not the case that always doubling the cores doubles the speed. Software must be written to utilise extra cores so make sure the application you wish to run can make use of the cores you request. Useful advice if you are not certain is to submit some short runs of your jobs and see whether they perform as well as they should do.

Note that requesting more cores will likely mean that you need to queue for longer before the job can start. Also, using the --nodes option may mean needing to queue for longer to find the desired resources compared to not using it.

An efficient way to parallelise your work without needing to know whether your application supports multi-threading is to use array jobs. See the Array Jobs section for more details on this.

Memory-bound Jobs

Some applications benefit from having exclusive access to all of the memory on a node while only using some cores (e.g. VASP). The way to ask for this is to use a combination of --nodes, --mem and --exclusive.

For example, if you want exclusive use of a single node with 100GB of memory and 10 cores, then include the following lines in your job script:

#SBATCH --nodes 1
#SBATCH --mem 100gb
#SBATCH --ntasks 10
#SBATCH --exclusive

Slurm will allocate the resources requested and prevent any other jobs from sharing that node while the job is running. This should only be used if it is necessary for your application as widespread exclusive use of nodes will increase queue time for all users.

Processor Types

BlueBEAR is made up of nodes with different processors.

Q) Which processor type will my job use?
A) By default, your job will be allocated to use any of the available processor types.

Q) Will my job be split accross multiple different processor types?
A) No. Your job will run on nodes all with the same processor.

Q) Should I choose a processor type?
A) It is not normally necessary to choose a processor type. But if your job has particular performance or processor instruction needs (see below) then you may wish to do so.

These nodes have different characteristics as follows:

Haswell/Broadwell

In 2017 we launched our water-cooled nodes with Haswell and Broadwell CPUs. Haswell and Broadwell use the same micro-architecture and so are both available as "haswell" in BlueBEAR:

#SBATCH --constraint haswell

Sandybridge

Also available are nodes from the previous generation of BlueBEAR with Sandybridge CPUs, to provide extra compute power:

#SBATCH --constraint sandybridge

As the Sandybridge nodes only have 64GB of RAM, we recommend setting the following to make efficient use of these resources:

#SBATCH --mem-per-cpu 3700

Resource Limits

All QOS have a maximum walltime of 10 days per job (with the exception of bbshort which is 10 minutes).

Additionally, as of November 2018, each user is limited to using the following (per shared QOS) at any one time:

  • 300 cores
  • 3TB of memory

This is summed across all running jobs. There are no additional limits on the sizes of individual jobs. If any limit is exceeded, any future jobs (for that QOS) will remain queued until the usage falls below these limits again.

Different limits may be set on QOS relating to user-owned resources, but these are generally unlimited.

Migrating Old Job Scripts

If you have used BlueBEAR before, then you will have written your job scripts for the MOAB scheduler. To ease the transition to Slurm we have provided a script called moabtoslurm which takes a script with MOAB headers (lines that begin with #MOAB) and outputs the script to stdout with the #MOAB lines replaced with the equivalent #SBATCH lines. Any other messages are outputted to stderr. The script is used in the following way:

$ moabtoslurm job.moab > job.slurm

There is a guidance page available with information on migrating old job scripts to Slurm and the equivalent SBATCH headers for common MOAB headers. That guidance page and moabtoslurm will become deprecated within the coming months.

It is also possible to continue using msub to submit these jobs on BlueBEAR. Just log in as normal and run "msub" as before - but this will run a wrapper script to submit the job to Slurm and we advise you to convert the job scripts to Slurm for full functionality.

 


Last modified: 21 January 2019