The HEC login node acts as the interface between you and the HEC proper (the cluster of compute nodes). Rather than running compute-intensive applications directly on the login node, they must be submitted from the login node to the SGE (Son of Grid Engine) job scheduling system as jobs. The two basic types of job are Batch and Interactive, submitted via the qsub and qlogin commands described below. Computationally intensive and/or large memory jobs must NOT be run directly on the login node. The login node has limited resources and should be used for the purposes of job submission, pre- and post-processing of job data and compilation.

If you need to test such jobs before submitting them, please use either the test queue or an interactive job session.



More information

 Batch jobs

A batch job is one which can be run without user intervention (i.e. it does not require any input from the keyboard and does not send any output to the user's screen). Typically a batch job will read any input it needs from a pre-written file, and send it output to files in the user's directory. The exact methods of doing this depend on each application.


Batch jobs are run on the HEC by creating a batch job script (or command file) and submitting it to the system using the command qsub. For example:

qsub my_program.com


Assuming that there is at least one job-slot free on the cluster, the job scheduler will select a compute node on which to run your job. This ensures that the combined load of all users' jobs is spread evenly over the entire cluster without overloading any one resource. If no suitable job slot is available at the time then the job will be held "queued and waiting" until one becomes free. To see how busy the HEC is, use the qslots command, which reports the number of available job slots.

At present, the system uses a Fair Share scheduling strategy; users may submit any number of jobs, and jobs over a certain number will be held waiting, while priority will be given to those who are currently running fewer jobs. Please check the login node's message of the day for changes to scheduling.

The majority of compute nodes on the HEC have 64 gigabytes of memory, and can run 16 single-core (serial) jobs simultaneously. However, if your job requires more than 1/2 gigabyte of memory, you are required to submit your jobs with a memory resource request to allow the scheduler to assign jobs to compute nodes without the risk of memory oversubscription — see directions on running large memory jobs.


 Example of a batch job script
#$ -S /bin/bash

#$ -q serial
#$ -N myjobname

source /etc/profile

echo Job running on compute node `uname -n` 

Explanation

Batch job scripts are simply standard shell scripts with extra lines (beginning with #$) containing instructions for the scheduler. The first line:

#$ -S /bin/bash

Instructs the scheduler to run the job using the bash shell. This is strongly recommended as best practice — all job templates and examples on these pages are written using bash.

The next line:

#$ -q serial

Directs the job to the serial queue. This queue is intended for running single-core jobs, and should be the default queue for most jobs. Different queues exist to support more advanced job types, such as parallel jobs or those which require specific node types. These are covered in the Advanced Job Submissions section on the main HEC Help page.

The next line:

#$ -N myjobname

Sets a name for your job, so that you can easily identify it while it's running. The name will also be used to create the job output files (see below).

The final job setup line reads:

source /etc/profile

This will set up the bash shell environment of the job so that it matches the functionality you see on the login node.

Once the batch job environment has been specified, subsequent lines should contain the commands needed to run your job. The job will effectively run as shell script, and will process any of the usual commands permitted from the specified shell. The example command above is:

echo Job running on compute node `uname -n`

which simply prints a short message to say which compute node the job was run on. See the Software section of these web pages for templates of job scripts for popular packages.




 Job submission

A batch job script is submitted for running by the qsub command. The script above could be run by typing:

qsub my_program.com


Once the job is submitted, you should see a response like this displayed on the screen:

Your job 154 ("myjobname") has been submitted


The number given is the job number — a unique ID to allow you to identify your job among the hundreds running on the cluster. The progress of your job(s) can be monitored with the qstat command. For more details, see Monitoring jobs on the HEC.

Once the job completes, the output will be placed in two files in the same directory from which the job was submitted. The files will be named after for job name (the -N directive), appended with o (for standard output) or e (for standard error) along with the job number. (for example, the example job script above will create two output files myjobname.o154 and myjobname.e154).

If for any reason you wish to cancel a job, perhaps because it is giving the wrong output or because you submitted it by mistake, you can do so with the command qdel. It takes as its argument the job-ID provided when you first submit the job (which is also displayed by qstat). So to kill the job submitted in the above example, with job ID 154, you would enter:

qdel 154




 Interactive jobs

While batch jobs are the most efficient type of job to submit, some applications require regular user input making them unsuitable for batch job submission. In such cases, jobs can be submitted interactively, giving you a command line shell on a compute node with sufficient free resources to run your application. You can submit an interactive job with the following command:

qlogin


If the interactive job request can be satisfied you will receive a response like this:

wayland% qlogin
Your job 6220611 ("QLOGIN") has been submitted
waiting for interactive job to be scheduled ...
Your interactive job 6220611 has been successfully scheduled.
Establishing /usr/shared_apps/packages/sge-8.1.8-1/default/qlogin_wrapper session to host comp06-03.private.dns.zone ...
Last login: Mon Aug 10 14:39:45 2015 from wayland.private.dns.zone
comp06-03>


The prompt comp06-03> indicates that you now have a login session on a compute node (in this example comp06-03).

Don't forget to log out from your interactive session when you have finished your tasks — your job slot and any resources it reserves is not available to anyone else until you do so.

If the cluster is busy, and no free slots are available that meet your requirements you will see the following error message:

waiting for interactive job to be scheduled ...timeout (4 s) expired while waiting on socket fd 4 Your "qlogin" request could not be scheduled, try again later.


If you wish your interactive job request to wait until a free slot becomes available, add the argument -now no to the qlogin command. You can cancel this at any time by pressing Control-C.




 The test queue

The test queue exists to allow quick-turnaround testing of jobs during normal business hours when the cluster is otherwise busy by dedicating a single compute node for this purpose. It can be frustrating to wait a few hours for a job to launch on a busy cluster only to have it fail immediately on launch due to a typo in the job submission script. If you want to do a quick sanity check of a new or altered job submission script, or if you want to try out some small jobs to get the hang of the job submission system, then the test queue is recommended.

To use this queue, simply add -q test to your qsub job submission command to divert the job to the test queue. This queue is usually lightly loaded, and should give very fast turnaround.

To ensure fast turnaround, jobs submitted to the test queue are limited to a maximum of 5 minutes run time. Jobs running for more than 5 minutes in this queue will be automatically terminated.

The test queue is available only on a single dedicated compute node which has 16 cores, 64G of memory and node_type 10Geth64G.




 The night queue

Outside of normal business hours, the compute node dedicated to the test queue also offers a further queue. The night queue has been set up to offer reasonable turnaround for short-duration pilot or test jobs. Jobs of up to 30 minutes run time can be submitted to this queue. To prevent undue delays to users of the test queue, the night queue only activates between 18:00 and 08:00. Jobs submitted to the night queue outside of these hours will wait until the activation window.

To submit jobs to the night queue, simply add -q night to your qsub job submission command.



Related pages