Rocket is a cluster of computer that belongs to the Newcastle University.
It is meant to be used for computations that are too bit, too long or too numerous to be done ona single machine.
It is shared between many users and runs on linus.
There are some smaller HPC facilities, such as SAgE and Computer science.
Setup
Registering for access
Access to Rocket is by membership of a project
- online project registration
- must be done by a member of staff
- access to shared project space and files
The access uses NUIT username and password
- something like c1231212 and MyPassword01012000
Logging in
You will need to login using a secure shell (ssh) connection using one of the methods listed below. The first time you login to Rocket by any method, you will see a message asking if you are sure about connecting to this server with a message similar to this:
The authenticity of host 'rocket.hpc.ncl.ac.uk (128.240.216.45)' can't be established.
RSA key fingerprint is SHA256:b0dhlXfhFhh+vjkhS4lYg+06KjDyM6qe6jlwGh7vBzk.
Are you sure you want to continue connecting (yes/no/[fingerprint])?
You can agree to continue the connection by typing yes
.
This message should only appear the first time you login.
Logging in from main campus
Windows:
- Terminal ssh:
- Open a powershell terminal
- Type
ssh <user>@rocket.hpc.ncl.ac.uk
, where user is your IT Service user name (e.g. nab789, b1234567). - Type your IT Service password when prompted
- Putty:
- Launch Putty
- Put
rocket.hpc.ncl.ac.uk
in theHost Name (or IP address)
box leaving all the other default options unchanged - Click on
Open
- Put your IT Service user name (e.g. nab789, b1234567) and password when prompted
- MobaXterm:
- Click on
Session
- In the Session window, select SSH
- In the Remote host field, type rocket.hpc.ncl.ac.uk
- Click on
OK
- Click on
Linux and Mac
- Open a terminal
- Type
ssh <user>@rocket.hpc.ncl.ac.uk
, where user is your IT Service user name (e.g. nab789, b1234567). - Type your IT Service password when prompted
Logging in from outside the main campus
Rocket is only directly accessible from campus machines.
From outside the campus, connect first to unix.ncl.ac.uk
in the same ways as above or using RAS.
You can connect to Rocket from the unix server as described before.
Transferring files
From command line
The tool used to transfer files is called scp
.
It should be present on all modern operating systems.
It is able to upload or download file to/from Rocket.
# Upload <local file> to Rocket, at the path specified by <remote file>.
# Relative paths on Rocket start at your home directory by default
scp <local file> Rocket:<remote file>
# Download <remote file> from Rocket to your local machine at the path specified by <local file>.
# Relative paths on Rocket start at your home directory by default
scp Rocket:<remote file> <local file>
WinSCP
On windows, you can download and use the program WinSCP for an easier user experience.
Set the host name to rocket.hpc.ncl.ac.uk
, then enter your username and password.
You can then type Ctrl-P, or click on New session in PUTTY
on the top menu.
In the black pop-up window, enter your username and password again.
You will be presented with an interface that supports drag and drop of files from and to both your local machine and Rocket.
Logging out
To log out from an ssh
session, simply use the exit
command.
Generic information
Linux shell
The operating system on Rocket’s login and compute nodes is CentOS 7, which is a Linux distribution. Hence, the shell you will be shown when connected is a Linux shell; bash, to be precise.
If you have no or very limited experience with terminals, this could be intimidating. While a full course on the subject is outside the scope of thi text, what follows is a brief introduction:
Command basics
There are some basic features that apply to most linux commands.
- Linux is case-sensitive, so ABC is not the same as Abc. This applies to all of linux, including filenames, commands, usernames and passwords.
- Commands are usually in lower case.
- After typing a command, press the Enter key to execute it.
- Many commands have options to extend or change the results. These are usually invoked with a hyphen, in the form:
command -option1 -option2
- An asterisk (*) will act as a wildcard, e.g. for filenames
- Use the tab key to auto-complete command names and file/directory names. Start typing the command until the remainder is likely to be unique, and then press the tab key to attempt to auto-complete the command.
- Use the up-arrow to return to a previous command.
Most common commands
# Show the current path
# e.g. pwd => /mnt/nfs/home/<username>
pwd
# List the contents of your current directory
# e.g. ls => file1 file2
ls
# Create a subdirectory
# e.g. mkdir my_folder
mkdir <folder_name>
# Navigate to the indicated directory.
# If the path starts with a /, it is called absolute and starts from the root of the filesystem
# Otherwise it will be relative to the current directory
# The symbol ~ can be used to indicate the path /mnt/nfs/home/<username>
# e.g. cd /absolute/path/new/directory
# e.g. cd relative/path/new/directory
# e.g. cd ~/my_directory
cd <path>
# Show the content of a file all at once
# e.g. cat file1
cat <file>
# Show the content of a file with pagination. Useful for bigger files
# e.g. more file1
more <file>
Text editor
Rocket includes a bunch of terminal text editor out of the box, such as vi, vim emacs and nano.
The last one, nano, is probably the easiest to use for a beginner.
To open file in the editor, type:
nano <file>
To exit from nano, saving your changes, type Ctrl-X
followed by Y
.
Further options are shown at the bottom of the nano screen, where ^
indicates the Ctrl
key.
If you need to know more details about nano or any other linux command, try using the man
command to look at the man pages (manual pages).
man nano
File space
-
Home directory
/home/c1231212
- 40GB quota
-
Lustre storage
/nobackup
- 500TB quota
- Shared between all users, with no single quota enforced
- Files deleted after 3 months of inactivity
-
Temp storage
/scratch
- Accessible via the environment variable
$TEMPDIR
- A subdirectory will be created on
/scratch
for each job and - Located on the same partition of the node running the job, fast access, hence improve performance on read and write file operations
Warning
Keep in mind that data is never backed up, regardless of where it is stored on Rocket
Loading diagram...
Code of conduct
Since most resources are shared between all users, there are some rules to follow:
- Use storage considerately
- Use the login nodes considerately
- Use the computer nodes types you need
- Construct your jobs carefully
- Be mindful of the software licenses
Modules
Since nobody is given privileged permissions, the way to access the software you want is to use the module
system.
Rocket has an extensive set of applications available to users.
Compilers and most specialist applications on Rocket are installed as modules, which must be loaded before they can be used.
Use of modules helps avoid conflicts between different versions and applications and helps you to reproduce computations knowing that you are using exactly the same application.
# Full list of modules
module avail
# Search a specific module
# e.g. avail matlab
module avail <module>
# For more detailed information about a module
# e.g. module spider matlab
module spider <module>
# If you want to refine your search, maybe because there are too many results, you may use grep
# e.g. module –redirect avail R | grep '^ R'
module –redirect avail <module> | grep '^ <module>'
Once you have located the module you are interested in, you have to load it.
Be careful that, as most things on Linux, modules names are case sensitive.
Loading a module will also load all its dependencies as well automatically.
If everything went well, you will have access to the software.
# Load the module
# e.g. module load R
module load <module>
# Show all the installed modules
module list
# Remove a module from the loaded ones
# e.g. module unload R
module unload <module>
# Remove all loaded modules
module purge
SLURM
SLURM is the software in charge of scheduling and handling jobs on the whole Rocket cluster. Computation that require significant resources (CPU, memory, I/O) should not be run on the login nodes, but instead submitted to SLURM. SLURM adds them to the queue of jobs and runs them on a compute node when resources become available. SLURM is configured so that it aims to make access fair for all users.
Standard job
The most common way to create a SLURM job is to create a bash file (e.g. job.sh) and submit it to the scheduler. The following is very simple script it is possible to run on Rocket.
#!/bin/bash
#
#SBATCH -A training
#SBATCH -t 00:05:00
#
module load Python
python --version
date
sleep 60
The first line tells the runner which shell to use (e.g. /bin/bash
).
In general, lines starting with #
are interpreted as comments and are ignored by the runner.
You could append any text you want on the same line, with one exception.
If the line starts with #SBATCH
, like lines 3 and 4, they are interpreted as SLURM directives.
All SLURM directive must come before the first executable commands.
Following ones will be treated as normal comments.
In this example the options are
#SBATCH -A <projectcode>
: id of the project this job belongs to#SBATCH -t <dd-hh:mm:ss>
: maximum time this job will be able to run for
The last four lines are a series of commands the job will execute. Namely:
- load the Python module
- print the version of python
- print the date
- sleep for 60 seconds
To submit a job to SLURM so that it runs on a compute node:
# Submit a job to SLURM.
# e.g. sbatch job.sh
sbatch <job>
Once it has been put on the queue, it is possible to query the status of the job with
# Show the status of all your jobs that have been enqueued today
sacct
If you ran the executable commands from job.sh at the Rocket command prompt, any output would be displayed on the screen like this:
Python 3.6.3
Tue Jan 22 13:46:47 GMT 2019
When you submit a script using sbatch
, it will run completely separately from your interactive login.
Any output that would normally be displayed on the screen will be written to a file instead, by default called slurm-<jobID>.out.
If a job fails, the reasons can often be found in this file.
When your job has completed, check the contents of the SLURM output file:
# List all the files that start with 'slurm'
ls slurm*
# Read the content of the file
more slurm-<jobID>.out
Arrays
SLURM allows to submit a job array, which is a set of jobs that are very similar to each other.
Each is provided with a different index, accessible by the script as the environment variable SLURM_ARRAY_TASK_ID
.
#!/bin/bash
#
#SBATCH -A training
#SBATCH -t 00:05:00
#
# Submit an array of 10 jobs, with index from 1 to 10 (included)
#SBATCH --array=1-10
echo This is element ${SLURM_ARRAY_TASK_ID}
mkdir test_${SLURM_ARRAY_JOB_ID}_${SLURM_ARRAY_TASK_ID}
The --array
option is used to specify the range of indices.
It supports different options:
# The array is limited to running 3 concurrent jobs with indices from 1 to 15 (included)
#SBATCH --array=5-15%3
# The array will contain 4 jobs with indices 1, 39, 47 and 94
#SBATCH --array=1,39,47,94
It is possible to have up to 10,000 separate jobs or array jobs with a total of 10,000 elements in the system. Array jobs are can run on up to 528 cores at once.
Message Passing Interface (MPI)
The Message Passing Interface (MPI) allows the creation of very large computations, sometimes spread over many nodes.
MPI applications run multiple instances of the same program, each with a separate memory allocation.
Each instance executes its own portion of the overall calculation and communicates with other instances using MPI messages.
For codes that use MPI only and do not use OpenMP, it is usual to run one MPI instance or task per core.
MPI programs are normally run using the mpirun
command.
SLURM has an equivalent command srun
that you should use on Rocket as it recognizes your SLURM resource allocation.
The following script will create 88 instances of the program, each with a separate memory allocation.
#!/bin/bash
#
# simple MPI job script for SLURM
#
#SBATCH -A training
#SBATCH -t 00:05:00
#SBATCH --ntasks=88
#SBATCH --tasks-per-node=44
#
module load OpenMPI
srun mpi_example
Note
SLURM‘s default scheduling can cause MPI jobs to be spread across many nodes, which is not efficient.
Use the options above to make sure that this does not happen. For large jobs that fill entire nodes, like the one above, you could also consider the option -–exclusive
.
This ensures that the job does not share nodes with any other jobs, which can improve performance and ease scheduling for other large jobs.
When the job has finished, it is possible to use sacct
to see how long the job took to execute.
Remember that SLURM reports on the job’s total elapsed time, including the overheads of starting and finishing the job.
Some codes use a combination of OpenMP and MPI. For example, an application might be designed to run 1 MPI task on each node, and use 44 OpenMP threads within each MPI task.
If your code has this type of hybrid parallelization, make sure that your job script sets OMP_NUM_THREADS even if you are not using the OpenMP features of your code.
Interactive jobs
The login nodes can be used for light interactive work, but the login nodes are shared by many users and it is easy for one user to cause problems for others, e.g. if they run memory- or CPU-intensive jobs. If you do need to run something intensive interactively, SLURM allows you to run an interactive session on a compute node. The job can be specified on the command line with the normal SLURM options, and will start only if there is space for it.
# Launch an interactive job on a compute node
# e.g. srun -A training -t 00:05:00 -c 4 --pty bash
srun -A <projectid> -t <time> -c <cores> <executable>
It is also possible to start an interactive shell on a compute node. This can be useful for development work and testing.
When you have finished with an interactive shell, make sure you type exit
to release the resources for other users.
Snippets
Common SLURM options
# Set the project id
# e.g. #SBATCH -A training
#SBATCH -A <projectid>
# Type of nodes to use:
# NAME TYPE NODES MAX TIME DEFAULT TIME MAX MEMORY
# defq standard 528 cores 2 days 2 days 2.5 GB
# bigmem medium,large,XL 2 nodes 2 days(*) 2 days 11 GB
# short all 2 nodes 10 minutes 1 minute 2.5 GB
# long standard 2 nodes 30 days 5 days 2.5 GB
# power power 1 node 2 days 2 days 2.5 GB
# interactive all
# e.g. #SBATCH -p defq
#SBATCH -p <partition>
# Set the memory per node
# e.g. #SBATCH --mem=2G
#SBATCH --mem=<memory>
# Memory per core
# e.g. #SBATCH --mem-per-cpu=2G
#SBATCH --mem-per-cpu=<memory>
# Time limit in the format dd-hh:mm:ss
# e.g. #SBATCH -t=1-00:00:00
#SBATCH -t=<time>
# Type of mail to send to the user when the status of the job changes
# TYPES: NONE, BEGIN, END, FAIL, REQUEUE, ALL
# e.g. #SBATCH --mail-type=ALL
#SBATCH --mail-type=<type>
# Number of cores per task
# e.g. #SBATCH -c=4
#SBATCH --c=<cores>
# Number of tasks
# e.g. #SBATCH -n=4
#SBATCH -n=<tasks>
# Number of tasks per node
# e.g. #SBATCH --ntasks-per-node=4
#SBATCH --ntasks-per-node=<tasks>
# Whether the job requires exclusive access to the node
#SBATCH --exclusive
# Array of jobs
# e.g. #SBATCH --array=1-10
#SBATCH --array=<start>-<end>
Bash commands
# Load python using miniconda
module load Miniconda3/4.9.2
# Find all files over a certain size (c = bytes, k = kilobytes, M = megabytes, G = gigabytes)
# e.g. find . -size +80c
find <files> -size +<size><unit>
# Combine multiple csv into a single file while keeping only the header of the first file
# e.g. awk '(NR == 1) || (FNR > 1)' *.csv > results.csv
awk '(NR == 1) || (FNR > 1)' *.csv > results_$(date -u).csv
# Check how many files are in a directory.
# e.g. ls smt2 -1 | wc -l
ls <path> -1 | wc -l
# Verify current execution status
# e.g. squeue -u c1231212
squeue -u <user>
# Cancel all your jobs
# e.g. scancel -u c1231212
scancel -u <user>
# Compress a directory into an archive, allowing to download it in one go
# e.g. tar -czvf results.tar.gz csv/*
tar -czvf <archive>.tar.gz <directory>/*