BlueBEAR Job Submission

Jobs on the cluster are under the control of a scheduling system (Slurm). The scheduling system is configured to offer an equitable distribution of resources over time to all users. The key means by which this is achieved are:

  • Jobs are scheduled according to the QOS (Quality of Service) and the resources that are requested. Information on how to request resources for your job are detailed below.
  • Jobs are not necessarily run in the order in which they are submitted.
  • Jobs requiring a large number of cores and/or long walltime will have to queue until the requested resources become available. The system will run smaller jobs, that can fit in available gaps, until all of the resources that have been requested for the larger job become available - this is known as backfill. Hence it is beneficial to specify a realistic walltime for a job so it can be fitted in the gaps.

Job Handling

Submitting a job

The command to submit a job is sbatch. This reads its input from a file. The job is submitted to the scheduling system, using the requested resources, and will run on the first available node(s) that are able to provide the resources requested. For example, to submit the set of commands contained in the file myscript.sh, use the command:

sbatch myscript.sh

The system will return a job number, for example:

Submitted batch job 55260

Slurm is aware of your current working directory when submitting the job so there is no need to manually specify it in the script.

Upon completion of the job, there will be two output files in the directory from which you submitted the job. These files, for job id 55260, are slurm-55260.out and slurm-55260.stats. The first of these contains the standard out and standard error output; the second contains information about the job from Slurm.

Cancelling a job

To cancel a queued or running job use the scancel command and supply it with the job ID that is to be cancelled. For example, to cancel the previous job:

scancel 55260

Monitoring Your Job

There are a number of ways to monitor the current status of your job. You can view what's going on by issuing any one of the following commands:

squeue

squeue -j 55620

squeue is Slurm's command for viewing the status of your jobs. This shows information such as the job's ID and name, the QOS used (the "partition", which will tell you the node type), the user that submitted the job, time elapsed and the number of nodes being used.

scontrol

scontrol show job 55620

scontrol is a powerful interface that provides an advanced amount of detail regarding the status of your job. The show command within scontrol can be used to view details regarding a specific job.

Job QOS

Job QOS is similar to the concept of 'job queues'. By default, there are two QOS to which you can submit jobs. These are: bbdefault and bbshort. You may also have access to bblargemem or bbgpu. All shared QOS have a maximum job length (walltime) of 10 days, with the exception of bbshort where it is 10 minutes. You can specify which QOS you want by adding the following line to your job script:

#SBATCH --qos bbshort

bbdefault

This is the default QOS if no --qos is provided in the job script. This QOS is made up of different types of node such as:

  • 40 cores with 190GB memory
  • 24 cores with 120GB memory
  • 20 cores with 120GB memory

You can request the amount of memory you want using either --mem or --mem-per-cpu. If neither of these two options are specified in your job then --mem-per-cpu is set to 4096MB by default. You can specify these in megabytes (M), gigabytes (G) or terabytes (T) with the default unit being M if none is given.

bbshort

This QOS contains all nodes in the cluster and is the fastest way to get your job to run. The maximum walltime is 10 minutes.

bblargemem

This QOS contains a mixture of large memory nodes which are available if your job requires a larger amount of memory on one node. Please see the Large Memory Service page for more details on these nodes.

bbgpu

This QOS contains a mixture of GPU nodes which are available if your job requires a GPU. Please see the GPU Service page for more details on these nodes.

Dedicated Resources

Some research groups have dedicated resources in BlueBEAR. Those users who can submit jobs to a dedicated QOS can see what jobs are running in the dedicated QOS by using the following command with <name> replaced with the name of your QOS:

        view_qos <name>

CaStLeS Resources

The University has invested substantial funds in an initiative called CaStLeS. As part of this there is a dedicated QOS for CaStLeS. These resources are reserved exclusively for the use of research groups carrying out research in the life sciences and governed by academics through the CaStLeS Executive and Strategic Oversight Group.

Resource Limits

For bbdefault, bblargemem, and bbgpu the maximum walltime is 10 days per job. For bbshort the maximum walltime is 10 minutes. Each user is limited to 400 cores or 4TB of memory per shared QOS. This is summed across all running jobs. There are no additional limits on the sizes of individual jobs. If any limit is exceeded, any future jobs (for that QOS) will remain queued until the usage falls below these limits again.

Different limits may be set on QOS relating to user-owned resources, but these are generally unlimited.

Job Script Options


An example job script

This example is job.sh

#!/bin/bash
#SBATCH --ntasks 1
#SBATCH --time 5:0
#SBATCH --qos bbshort
#SBATCH --mail-type ALL

set -e

module purge; module load bluebear
module load apps/matlab/2020a

matlab -nodisplay -r cvxtest
  • #!/bin/bash - run the job using GNU Bourne Again Shell (the same shell as the logon nodes).
  • #SBATCH --ntasks 1 - run the job with one core.
  • #SBATCH --time 5:0 - run the job with a walltime of 5 minutes.
  • #SBATCH --qos bbshort - run the job in the bbshort QOS. See the Job QOS section for more details.
  • #SBATCH --mail-type ALL - Notify a user about all events regarding the job.
  • set -e - This will make your script fail on first error. This is recommmended as early errors can easily be missed. It should be noted that "module load" commands will not trigger a failure.
  • module purge; module load bluebear - This line is required for reproducibility of results because it makes sure that only the default BlueBEAR modules are loaded at this point in the script. This will ignore any modules loaded in files like .bashrc. Include this line before any other module load statements in job scripts.
  • module load apps/matlab/r2017a - Loads the Matlab 2017a module into the environment.
  • matlab -nodisplay -r cvxtest - The command to run the Matlab example.

If you wish to run this example, follow these instructions:

sbatch job.sh

The above is a simple example, the options and commands can be as complex as necessary. All of the options that can be set can be viewed on Slurm's documentation for sbatch. You can add any of these options as command line arguments but it is recommended that you add all of your options into your job script for ease of reproducibility.

Associate Jobs with Projects

Every job has to be associated with a project to ensure the equitable distribution of resources. Project owners and members will have been issued a project code for each registered project, and only usernames authorised by the project owner will be able to run jobs using that project code. You can see what projects you are a member of by running the command:

my_bluebear

If a user is registered on more than one project then it should be specified using the --account option followed by the project code. For example, if your project is project-name then add the following line to your job script:

#SBATCH --account=project-name

If a job is submitted using an invalid project, either because the project does not exist, the username is not authorised to use that project, or the project does not have access to the requested QOS, then the job will be rejected with the following error:

sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified

Processor Types

BlueBEAR is made up of nodes with different processors. By default, your job will be allocated to use any of the available processor types, but a single job will not be split accross multiple different processor types. It is not normally necessary to choose a processor type, but if your job has particular performance or processor instruction needs (see below) then you may wish to do so.

Cascade Lake

In 2019 we launched water-cooled nodes with Cascade Lake CPUs:

#SBATCH --constraint cascadelake

Haswell/Broadwell

In 2017 we launched our water-cooled nodes with Haswell and Broadwell CPUs. Haswell and Broadwell use the same micro-architecture and so are both available as "haswell" in BlueBEAR:

#SBATCH --constraint haswell

Use Local Disk Space

If a job uses significant I/O (Input/Output) then files should be created using the local disk space and only written back to the final directory when the job is completed. This is particularly important if a job makes heavy use of disk for scratch or work files. Heavy I/O to the network filestores such as the home directories or /rds can cause problems for other users. There is a directory /scratch which is local to each node that can be used for temporary files that are associated with a job.

For jobs that are running on a single node, this filestore can also be used for input and output files for a job. Since /scratch is not shared across nodes it cannot be used for parallel jobs that use multiple nodes where all of the processes need to be able to read from or write to a shared filestore; but it can be used for multi-core jobs on a single node.

To use scratch space, include the following lines at the start of your job script (after the #SBATCH headers):

BB_WORKDIR=$(mktemp -d /scratch/${USER}_${SLURM_JOBID}.XXXXXX)
export TMPDIR=${BB_WORKDIR}

These two lines will create a directory for you based on your username (USER), the ID of the job you are running (SLURM_JOBID) followed by a period and six random letters. The use of random letters is recommended for security reasons. This is then exported to the environment as TMPDIR. Many applications utilise TMPDIR for temporary storage while it is running. If you want to copy your files back to the current directory at the end, insert this line:

test -d ${BB_WORKDIR} && /bin/cp -r ${BB_WORKDIR} .

This checks that the directory still exists (using test -d) then copies it and everything in it (using -r, for recursive copy) to the current directory (.) if it does. Then, at the end of your job script, insert the line:

test -d ${BB_WORKDIR} && /bin/rm -rf ${BB_WORKDIR}

This checks that the directory still exists (using test -d) then removes it if it does.

Multi-core Jobs

There are several ways to run a multi-core job using Slurm but the methods we recommend are to use the options --ntasks and --nodes. For most people, it is enough to specify only --ntasks. This is because Slurm is sophisticated enough to take your job and spread it across as many nodes as necessary to meet the number of cores that the job requires. This is practical for MPI jobs and means that cluster is used more efficiently (as multiple users can share nodes). For example, adding the following line will request a job with 20 cores on an undefined number of nodes:

#SBATCH --ntasks 20

For jobs that require multiple cores but must remain on a certain number of nodes (e.g. OpenMP jobs which can only run on a single node), the option --nodes can be specified with a minimum and maximum number of nodes. For example, the first example here specifies that between 3 and 5 nodes should be used; and the second example that a single node should be used:

#SBATCH --nodes 3-5
#SBATCH --nodes 1
The environment variable SLURM_NTASKS is set to the number of tasks requested. For single node, multi-core jobs this can be used directly to make the software fit the resources assigned to the job. For example:
./my_application --my-threads=${SLURM_NTASKS}

It is important to note that not all jobs scale well over multiple cores and sadly just doubling will often not double the speed. Software must be written to use extra cores so make sure the application you wish to run can make use of the cores you request. Useful advice if you are not certain is to submit some short runs of your jobs and see whether they perform as well as they should do.

Note that requesting more cores will probably mean that you need to queue for longer before the job can start. Also, using the --nodes option will often lengthen the queue time.

An efficient way to parallelise your work without needing to know whether your application supports multi-threading is to use array jobs. See the following section for more details on this.

Array Jobs

Array jobs are an efficient way of submitting many similar jobs that perform the same work using the same script but on different data. Sub-jobs are the jobs created by an array job and are identified by an array job ID and an index. For example, if 55620_1 is an identifier, the number 55620 is a job array ID, and 1 is the sub-job.

Example array job

#!/bin/bash
#SBATCH --time 5:0
#SBATCH --qos bbshort
#SBATCH --array 2-5

set -e
module purge; module load bluebear
echo "${SLURM_JOB_ID}: Job ${SLURM_ARRAY_TASK_ID} of ${SLURM_ARRAY_TASK_MAX} in the array"

In Slurm, there are different environment variables that can be used to dynamically keep track of these identifiers.

  • #SBATCH --array 2-5 tells Slurm that this job is an array job and that it should run 4 sub jobs (with IDs 2, 3, 4, 5). You can specify up to 4,096 array tasks in a single job (e.g. --array 1-4096).
  • SLURM_ARRAY_TASK_COUNT will be set to the number of tasks in the job array, so in the example this will be 4.
  • SLURM_ARRAY_TASK_ID will be set to the job array index value, so in the example there will be 4 sub-jobs, each with a different value (from 2 to 5).
  • SLURM_ARRAY_TASK_MIN will be set to the lowest job array index value, so in the example this will be 2.
  • SLURM_ARRAY_TASK_MAX will be set to the highest job array index value, son in the example this will be 5.
  • SLURM_ARRAY_JOB_ID will be set to the job ID provided by running the sbatch command.

The above job is similar to the one that can be found in the Job Script Options section above. Visit that section for a detailed explanation of the workings of a script similar to the one above. Visit the Job Array Support section of the Slurm documentation for more details on how to carry out an array job.

Colleges

Professional Services