Jobs on the cluster are under the control of a scheduling system (Slurm). The scheduling system is configured to offer an equitable distribution of resources over time to all users. The key means by which this is achieved are:
- Jobs are scheduled according to the QOS (Quality of Service) and the resources that are requested. Information on how to request resources for your job are detailed below.
- Jobs are not necessarily run in the order in which they are submitted.
- Jobs requiring a large number of cores and/or long walltime will have to queue until the requested resources become available. The system will run smaller jobs, that can fit in available gaps, until all of the resources that have been requested for the larger job become available - this is known as backfill. Hence it is beneficial to specify a realistic walltime for a job so it can be fitted in the gaps.
Job Handling
Submitting a job
The command to submit a job is sbatch
. This reads it's input from a file. The job is submitted to the scheduling system, using the requested resources, and will run on the first available node(s) that are able to provide the resources requested. For example, to submit the set of commands contained in the file myscript.sh
, use the command:
sbatch myscript.sh
The system will return a job number, for example:
Submitted batch job 55260
Slurm is aware of your current working directory when submitting the job so there is no need to manually specify it in the script. There is an example job script further down this page.
Upon completion of the job, there will be two output files in the directory from which you submitted the job. These files, for job id 55260, are slurm-55260.out
and slurm-55260.stats
. The first of these contains the standard out and standard error output; the second contains information about the job from Slurm.
Cancelling a job
To cancel a queued or running job use the scancel
command and supply it with the job ID that is to be cancelled. For example, to cancel the previous job:
scancel 55260
Monitoring Your Job
There are a number of ways to monitor the current status of your job. You can view what's going on by issuing any one of the following commands:
squeue
squeue -j 55620
squeue
is Slurm's command for viewing the status of your jobs. This shows information such as the job's ID and name, the QOS used (the "partition", which will tell you the node type), the user that submitted the job, time elapsed and the number of nodes being used.
scontrol
scontrol show job 55620
scontrol
is a powerful interface that provides an advanced amount of detail regarding the status of your job. The show command within scontrol can be used to view details regarding a specific job.
Job QOS
Job QOS is similar to the concept of 'job queues'. By default, there are two QOS to which you can submit jobs. These are: bbdefault and bbshort. You may also have access to bblargemem or bbgpu. All shared QOS have a maximum job length (walltime) of 10 days, with the exception of bbshort where it is 10 minutes. You can specify which QOS you want by adding the following line to your job script:
#SBATCH --qos bbshort
bbdefault
This is the default QOS if no --qos
is provided in the job script. This QOS is made up of different types of node such as:
- 72 cores with 500GB memory (IceLake)
- 40 cores with 190GB memory (CascadeLake)
- 20 cores with 120GB memory (Broadwell)
You can request the amount of memory you want using either --mem
or --mem-per-cpu
. If neither of these two options are specified in your job then --mem-per-cpu
is set to 4096MB by default. You can specify these in megabytes (M), gigabytes (G) or terabytes (T) with the default unit being M if none is given.
bbshort
This QOS contains all nodes in the cluster and is the fastest way to get your job to run. The maximum walltime is 10 minutes.
bblargemem
This QOS, pre-2022, contained a mixture of large memory nodes that were available if your job required a larger amount of memory on one node. When the Intel Ice Lake nodes were added the bblargemem QOS was retired. Please see the Large Memory Service page for more details on requesting more memory for a job.
bbgpu
This QOS contains a mixture of GPU nodes which are available if your job requires a GPU. Please see the
GPU Service page for more details on these nodes.
Dedicated Resources
Some research groups have dedicated resources in BlueBEAR. Those users who can submit jobs to a dedicated QOS can see what jobs are running in the dedicated QOS by using the following command with <name>
replaced with the name of your QOS:
view_qos <name>
CaStLeS Resources
The University has invested substantial funds in an initiative called CaStLeS. As part of this there is a dedicated QOS for CaStLeS. These resources are reserved exclusively for the use of research groups carrying out research in the life sciences and governed by academics through the CaStLeS Executive and Strategic Oversight Group.
Resource Limits
For bbdefault and bbgpu the maximum walltime is 10 days per job. For bbshort the maximum walltime is 10 minutes. Each user is limited to 576 cores or 4TB of memory per shared QOS. This is summed across all running jobs. There are no additional limits on the sizes of individual jobs. If any limit is exceeded, any future jobs (for that QOS) will remain queued until the usage falls below these limits again.
Different limits may be set on QOS relating to user-owned resources, but these are generally unlimited.
Job Script Options
An example job script
This example is job.sh
#!/bin/bash
#SBATCH --ntasks 1
#SBATCH --time 5:0
#SBATCH --qos bbshort
#SBATCH --mail-type ALL
set -e
module purge; module load bluebear
module load MATLAB/2020a
matlab -nodisplay -r cvxtest
#!/bin/bash
- run the job using GNU Bourne Again Shell (the same shell as the logon nodes).
#SBATCH --ntasks 1
- run the job with one core.
#SBATCH --time 5:0
- run the job with a walltime of 5 minutes.
#SBATCH --qos bbshort
- run the job in the bbshort QOS. See the Job QOS section for more details.
#SBATCH --mail-type ALL
- Notify a user about all events regarding the job.
set -e
- This will make your script fail on first error. This is recommmended as early errors can easily be missed.
module purge; module load bluebear
- This line is required for reproducibility of results because it makes sure that only the default BlueBEAR modules are loaded at this point in the script. This will ignore any modules loaded in files like .bashrc
. Include this line before any other module load statements in job scripts.
module load MATLAB/2020a -
Loads the MATLAB 2020a module into the environment.
matlab -nodisplay -r cvxtest
- The command to run the Matlab example.
If you wish to run this example, follow these instructions:
sbatch job.sh
The above is a simple example, the options and commands can be as complex as necessary. All of the options that can be set can be viewed on Slurm's documentation for sbatch. You can add any of these options as command line arguments but it is recommended that you add all of your options into your job script for ease of reproducibility.
Modules
Software on BlueBEAR is available through modules. In a job it is necessary to first load the modules for an application before that application is available. In the MATLAB example above, this can be seen with first the MATLAB module being loaded (module load MATLAB/2020a
) and then MATLAB is run (matlab -nodisplay -r cvxtest
).
If an attempt to load conflicting modules is made then you will receive an error message of:
Lmod has detected the following error:
The module load command you have run has failed, as it would result in an incompatible mix of modules.
Associate Jobs with Projects
Every job has to be associated with a project to ensure the equitable distribution of resources. Project owners and members will have been issued a project code for each registered project, and only usernames authorised by the project owner will be able to run jobs using that project code. You can see what projects you are a member of by running the command:
my_bluebear
If a user is registered on more than one project then it should be specified using the --account
option followed by the project code. For example, if your project is project-name then add the following line to your job script:
#SBATCH --account=project-name
If a job is submitted using an invalid project, either because the project does not exist, the username is not authorised to use that project, or the project does not have access to the requested QOS, then the job will be rejected with the following error:
sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified
Processor Types
BlueBEAR is made up of nodes with different processors. By default, your job will be allocated to use any of the available processor types, but a single job will not be split accross multiple different processor types. It is not normally necessary to choose a processor type, but if your job has particular performance or processor instruction needs (see below) then you may wish to do so.
Ice Lake
In 2021 we launched water-cooled nodes with Ice Lake CPUs:
#SBATCH --constraint=icelake
Each of these nodes has 2x 36 core Intel® Xeon® Platinum 8360Y and 500GB memory available for jobs.
Cascade Lake
In 2019 we launched water-cooled nodes with Cascade Lake CPUs:
#SBATCH --constraint=cascadelake
Each of these nodes has 2x 20 core Intel® Xeon® Gold 6248 and 190GB memory available for jobs.
Broadwell
In 2017 we launched our water-cooled nodes with Broadwell CPUs. Broadwell use the same micro-architecture as Haswell and so are both available as "haswell" in BlueBEAR:
#SBATCH --constraint=haswell
Each of the Broadwell nodes has 2x 10 core Intel® Xeon® Processor E5-2640 v4 and 120GB memory available for jobs.
Use Local Disk Space
If a job uses significant I/O (Input/Output) then files should be created using the local disk space and only written back to the final directory when the job is completed. This is particularly important if a job makes heavy use of disk for scratch or work files. Heavy I/O to the network filestores such as the home directories or /rds
can cause problems for other users. There is a directory /scratch
which is local to each node that can be used for temporary files that are associated with a job.
For jobs that are running on a single node, this filestore can also be used for input and output files for a job. Since /scratch
is not shared across nodes it cannot be used for parallel jobs that use multiple nodes where all of the processes need to be able to read from or write to a shared filestore; but it can be used for multi-core jobs on a single node.
To use scratch space, include the following lines at the start of your job script (after the #SBATCH
headers):
BB_WORKDIR=$(mktemp -d /scratch/${USER}_${SLURM_JOBID}.XXXXXX)
export TMPDIR=${BB_WORKDIR}
These two lines will create a directory for you based on your username (USER
), the ID of the job you are running (SLURM_JOBID
) followed by a period and six random letters. The use of random letters is recommended for security reasons. This is then exported to the environment as TMPDIR
. Many applications utilise TMPDIR
for temporary storage while it is running. If you want to copy your files back to the current directory at the end, insert this line:
test -d ${BB_WORKDIR} && /bin/cp -r ${BB_WORKDIR} .
This checks that the directory still exists (using test -d
) then copies it and everything in it (using -r
, for recursive copy) to the current directory (.
) if it does. Then, at the end of your job script, insert the line:
test -d ${BB_WORKDIR} && /bin/rm -rf ${BB_WORKDIR}
This checks that the directory still exists (using test -d
) then removes it if it does.
Multi-core Jobs
There are several ways to run a multi-core job using Slurm but the methods we recommend are to use the options --ntasks
and --nodes
. For most people, it is enough to specify only --ntasks
. This is because Slurm is sophisticated enough to take your job and spread it across as many nodes as necessary to meet the number of cores that the job requires. This is practical for MPI jobs and means that cluster is used more efficiently (as multiple users can share nodes). For example, adding the following line will request a job with 20 cores on an undefined number of nodes:
#SBATCH --ntasks 20
For jobs that require multiple cores but must remain on a certain number of nodes (e.g. OpenMP jobs which can only run on a single node), the option --nodes
can be specified with a minimum and maximum number of nodes. For example, the first example here specifies that between 3 and 5 nodes should be used; and the second example that a single node should be used:
#SBATCH --nodes 3-5
#SBATCH --nodes 1
The environment variable
SLURM_NTASKS
is set to the number of tasks requested. For single node, multi-core jobs this can be used directly to make the software fit the resources assigned to the job. For example:
./my_application --my-threads=${SLURM_NTASKS}
It is important to note that not all jobs scale well over multiple cores and sadly just doubling will often not double the speed. Software must be written to use extra cores so make sure the application you wish to run can make use of the cores you request. Useful advice if you are not certain is to submit some short runs of your jobs and see whether they perform as well as they should do.
Note that requesting more cores will probably mean that you need to queue for longer before the job can start. Also, using the --nodes
option will often lengthen the queue time.
An efficient way to parallelise your work without needing to know whether your application supports multi-threading is to use array jobs. See the following section for more details on this.
Array Jobs
Array jobs are an efficient way of submitting many similar jobs that perform the same work using the same script but on different data. Sub-jobs are the jobs created by an array job and are identified by an array job ID and an index. For example, if 55620_1 is an identifier, the number 55620 is a job array ID, and 1 is the sub-job.
Example array job
#!/bin/bash
#SBATCH --time 5:0
#SBATCH --qos bbshort
#SBATCH --array 2-5
set -e
module purge; module load bluebear
echo "${SLURM_JOB_ID}: Job ${SLURM_ARRAY_TASK_ID} of ${SLURM_ARRAY_TASK_MAX} in the array"
In Slurm, there are different environment variables that can be used to dynamically keep track of these identifiers.
#SBATCH --array 2-5
tells Slurm that this job is an array job and that it should run 4 sub jobs (with IDs 2, 3, 4, 5). You can specify up to 4,096 array tasks in a single job (e.g. --array 1-4096
).
SLURM_ARRAY_TASK_COUNT
will be set to the number of tasks in the job array, so in the example this will be 4.
SLURM_ARRAY_TASK_ID
will be set to the job array index value, so in the example there will be 4 sub-jobs, each with a different value (from 2 to 5).
SLURM_ARRAY_TASK_MIN
will be set to the lowest job array index value, so in the example this will be 2.
SLURM_ARRAY_TASK_MAX
will be set to the highest job array index value, son in the example this will be 5.
SLURM_ARRAY_JOB_ID
will be set to the job ID provided by running the sbatch
command.
The above job is similar to the one that can be found in the Job Script Options section above. Visit that section for a detailed explanation of the workings of a script similar to the one above. Visit the Job Array Support section of the Slurm documentation for more details on how to carry out an array job.