Submitting Jobs using PBS on Linux Clusters

On the Linux clusters, job submission is performed through PBS. You can find information about the following topics here.

  • Submitting Batch job
  • Submitting Multiple Dependent jobs
  • Interactive Parallel Sessions
  • Useful PBS Commands

List of useful PBS directives and their meaning:

  • #PBS -q queuename: Submit job to the queuename queue.
    • Allowed values for queuename: single, workq, checkpt.
    • Depending on cluster, addition values allowed are gpu, lasigma, mwfa, bigmem.
  • #PBS -A allocationname: Charge jobs to your allocation named allocationname.
  • #PBS -l walltime=hh:mm:ss: Request resources to run job for hh hours, mm minutes and ss seconds.
  • #PBS -l nodes=m:ppn=n: Request resources to run job on n processors each on m nodes.
  • #PBS -N jobname: Provide a name, jobname to your job to identify it when monitoring job using the qstat command.
  • #PBS -o filename.out: Write PBS standard output to file filename.out.
  • #PBS -e filename.err: Write PBS standard error to file filename.err.
  • #PBS -j oe: Combine PBS standard output and error to the same file. Note you will need either #PBS -o or #PBS -e directive not both.
  • #PBS -m status: Send an email after job status status is reached. Allowed values for status are
    • a: when job aborts
    • b: when job begins
    • e: when job ends
    • The arguments can be combined, for e.g. abe will send email when job begins and either aborts or ends
  • #PBS -M your email address: Address to send email to when the status directive above is trigerred.

List of useful PBS environmental variables and their meaning:

  • PBS_O_WORKDIR: Directory where the qsub command was executed
  • PBS_NODEFILE: Name of the file that contains a list of the HOSTS provided for the job
  • PBS_JOBID: Job ID number given to this job
  • PBS_QUEUE: Queue job is running in
  • PBS_WALLTIME: Walltime in secs requested
  • PBS_JOBNAME: Name of the job. This can be set using the -N option in the PBS script
  • PBS_ENVIRONMENT: Indicates job type, PBS_BATCH or PBS_INTERACTIVE
  • PBS_O_SHELL: value of the SHELL variable in the environment in which qsub was executed
  • PBS_O_HOME: Home directory of the user running qsub

Submitting a batch job

The current batch job manager on Dell Linux clusters is PBS. To send a batch job to PBS, users need to write a script that is readable by PBS to specify their needs. A PBS script is bascially a shell script which contains embedded information for PBS. The PBS information takes the form of a special comment line which starts with #PBS and continues with PBS specific options.

Two example scripts, with comments, illustrates how this is done. To set the context, we'll assume the user name is myName, and the script file is named myJob.

1. A Serial Job Script (One Process)

To run a serial job with PBS, you might create a bash shell script named myJob with the following contents:

 #!/bin/bash
 #
 # All PBS instructions must come at the beginning of the script ,before
 # any executable commands occur. 
 #
 # Start by selecting the "single" queue, and providing an allocation code.
 #
 #PBS -q single
 #PBS -A your_allocation_code
 #
 # To run a serial job, a single node with one process is required.
 #
 #PBS -l nodes=1:ppn=1
 # 
 # We then indicate how long the job should be allowed to run in terms of
 # wall-clock time. The job will be killed if it tries to run longer than this.
 #
 #PBS -l walltime=00:10:00
 # 
 # Tell PBS the name of a file to write standard output to, and that standard
 # error should be merged into standard output.
 #
 #PBS -o /scratch/myName/serial/output
 #PBS -j oe
 #
 # Give the job a name so it can be found readily with qstat.
 #
 #PBS -N MySerialJob
 #
 # That is it for PBS instructions. The rest of the file is a shell script.
 # 
 # PLEASE ADOPT THE EXECUTION SCHEME USED HERE IN YOUR OWN PBS SCRIPTS:
 #
 #   1. Copy the necessary files from your home directory to your scratch directory.
 #   2. Execute in your scratch directory.
 #   3. Copy any necessary files back to your home directory.

 # Let's mark the time things get started with a date-time stamp.

 date

 # Set some handy environment variables.

 export HOME_DIR=/home/myName/serial
 export WORK_DIR=/scratch/myName/serial
 
 # Make sure the WORK_DIR exists:

 mkdir -p $WORK_DIR

 # Copy files, jump to WORK_DIR, and execute a program called "demo"

 cp $HOME_DIR/demo $WORK_DIR
 cd $WORK_DIR
 ./demo

 # Mark the time it finishes.

 date

 # And we're out'a here!

 exit 0

Once the contents of myJob meets your requirements, it can be submitted with the qsub command as so:

qsub myJob

Back to Top

2. A Parallel Job Script (Multiple Processes)

To run a parallel job, you would follow much the same process as the previous example. This time the contents of your file myJob would contain:

 #!/bin/bash
 #
 # Use "workq" as the job queue, and specify the allocation code.
 #
 #PBS -q workq
 #PBS -A your_allocation_code
 # 
 # Assuming you want to run 16 processes, and each node supports 4 processes, 
 # you need to ask for a total of 4 nodes. The number of processes per node 
 # will vary from machine to machine, so double-check that your have the right 
 # values before submitting the job.
 #
 #PBS -l nodes=4:ppn=4
 # 
 # Set the maximum wall-clock time. In this case, 10 minutes.
 #
 #PBS -l walltime=00:10:00
 # 
 # Specify the name of a file which will receive all standard output,
 # and merge standard error with standard output.
 #
 #PBS -o /scratch/myName/parallel/output
 #PBS -j oe
 # 
 # Give the job a name so it can be easily tracked with qstat.
 #
 #PBS -N MyParJob
 #
 # That is it for PBS instructions. The rest of the file is a shell script.
 # 
 # PLEASE ADOPT THE EXECUTION SCHEME USED HERE IN YOUR OWN PBS SCRIPTS:
 #
 #   1. Copy the necessary files from your home directory to your scratch directory.
 #   2. Execute in your scratch directory.
 #   3. Copy any necessary files back to your home directory.

 # Let's mark the time things get started.

 date

 # Set some handy environment variables.

 export HOME_DIR=/home/$USER/parallel
 export WORK_DIR=/scratch/myName/parallel
 
 # Set a variable that will be used to tell MPI how many processes will be run.
 # This makes sure MPI gets the same information provided to PBS above.

 export NPROCS=`wc -l $PBS_NODEFILE |gawk '//{print $1}'`

 # Copy the files, jump to WORK_DIR, and execute! The program is named "hydro".

 cp $HOME_DIR/hydro $WORK_DIR
 cd $WORK_DIR
 mpirun -machinefile $PBS_NODEFILE -np $NPROCS $WORK_DIR/hydro

 # Mark the time processing ends.

 date
 
 # And we're out'a here!

 exit 0

Once the file myJob contains all the information for the desired parallel process, it can be submitted it with qsub, just as before:

 qsub myJob

Back to Top

3. Shell Environment Variables

Users with more experience writing shell scripts can take advantage of additional shell environment variables which are set by PBS when the job begins to execute. Those interested are directed to the qsub man page for a list and descriptions.

Back to Top

4. Last line issue in PBS job script

Due to a PBS scheduler issue, please always make sure you have a new line at the end of the job script, or the last command line of the job script might be ignored by the scheduler. For example, the line myjob.exe in below job script will be ignored by the PBS scheduler.

 #!/bin/bash
 #PBS -l nodes=1:ppn=20
 #PBS -l walltime=1:00:00
 #PBS -q workq
 #PBS -A allocation_name
 
 myjob.exe(END_OF_FILE)
 

Instead, adding a new line at the end of file will resolve this issue:

 #!/bin/bash
 #PBS -l nodes=1:ppn=20
 #PBS -l walltime=1:00:00
 #PBS -q workq
 #PBS -A allocation_name
 
 myjob.exe
 (END_OF_FILE)
 

Back to Top


Users may direct questions to sys-help@loni.org.

PBS Job Chains and Dependencies

Quite often, a single simulation requires multiple long runs which must be processed in sequence. One method for creating a sequence of batch jobs is to execute the "qsub" or "llsubmit" command to submit its successor. We strongly discourage recursive, or "self-submitting," scripts since for some jobs, chaining isn't an option. When your job hits the time limit, the batch system kills them and the command to submit a subsequent job is not processed.

In PBS, you can use the "qsub -W depend=..." option to create dependencies between jobs.

qsub -W depend=afterok:<Job-ID> <QSUB SCRIPT>

Here, the batch script <QSUB SCRIPT> will be submitted after the Job, <Job-ID> was successfully completed. Useful options to "depend=..." are

  • afterok:<Job-ID> Job is scheduled if the Job <Job-ID> exits without errors or is successfully completed.
  • afternotok:<Job-ID> Job is scheduled if the Job <Job-ID> exited with errors.
  • afterany:<Job-ID> Job is scheduled if the Job <Job-ID> exits with or without errors.

One method to simplify this process is to write multiple batch scripts, job1.pbs, job2.pbs, job3.pbs etc and submit them using the following script:

#!/bin/bash

FIRST=$(qsub job1.pbs)
echo $FIRST
SECOND=$(qsub -W depend=afterany:$FIRST job2.pbs)
echo $SECOND
THIRD=$(qsub -W depend=afterany:$SECOND job3.pbs)
echo $THIRD

Modify script according to number of job chained jobs required. The Job <$FIRST> will be placed in queue while the jobs <$SECOND> and <$THIRD> will be placed in queue with the "Not Queued" (NQ) flag in Batch Hold. When <$FIRST> is completed, the NQ flag will be replaced with the "Queued" (Q) flag and will be moved to the active queue.

A few words of caution: If you list the dependency as "afterok"/"afternotok" and your job exits with/without errors then your subsequent jobs will be killed due to "dependency not met".


Users may direct questions to sys-help@loni.org.

Interactive Parallel Sessions

An interactive session is a set of compute nodes which allow one to manually interact (ala shell, etc) with your programs while taking advantage of dedicated multiple processors/nodes. This is useful for development, debugging, running long sequential jobs, and testing. The following is meant to be a quick guide on how to achieve such a session on various LSU/LONI resources:

Note 1: these methods should work for all the Linux clusters on LONI/LSU, but the host names (e.g., tezpur.hpc.lsu.edu is used as the host name in the following) will need to reflect the machine that is being used. This is also the case with the ppn= (processors per node) keyword value (e.g., QB2 would be ppn=20).

Note 2: the commands below conform to the bash shell syntax. Your mileage may differ if you use a different shell.

Note 3: this method will require opening 2 terminal windows.

1. Interactive Method

1. In the terminal 1 window, login to the head node of the desired x86 Linux cluster:

 ssh -XY username@tezpur.hpc.lsu.edu

2. Once logged onto the head node, the next step is to reserve a set of nodes for interactive use. This is done by issuing a qsub command similar to the following:

 $ qsub -I -A allocation_account -V -l walltime=HH:MM:SS,nodes=NUM_NODEs:ppn=4 
  • HH:MM:SS - length of time you wish to use the nodes (resource availability applies as usual).
  • NUM_NODEs - the number of nodes you wish to have.
  • ppn - must match the number of cores available per node (system dependent).

You will likely have to wait a bit to get a node, and you will see a "waiting for job to start message" in the mean time. Once a prompt appears, the job has started.

3. After the job has started, the next step is to determine which nodes have been reserved for you. To do this, examine the contents of the node list file as set for you in the PBS_NODEFILE environment variable by the PBS system. One way to do this, and an example result, is:

 
$ printenv PBS_NODEFILE
/var/spool/torque/aux/xyz.tezpur2 

xyz is some number of digits representing the job number on tezpur.

4. Your terminal 1 session is now connected to the rank 0, or primary, compute node. You should now determine its host name:

 $ hostname
 tezpurIJK

Where IJK is a 3 digit number.

5. To actually begin using the node, repeat step 1 in a second terminal, terminal 2. Once logged onto the head node, connect from there to the node determined in step 4. The two steps would look like:

 On you client:  $ ssh -XY username@tezpur.hpc.lsu.edu
 On the headnode: $ ssh -XY tezpurIJK

You have two ways to approach the rest of this process, depending on which terminal window you want to enter commands in.

Back to Top

2. Using Terminal 2

6. In terminal 2 set the environmental variable, PBS_NODEFILE, to match what you found in step 3:

 $ export PBS_NODEFILE=/var/spool/torque/aux/xyz.tezpur2

7. Now you are set to run any programs you wish, using terminal 2 for your interactive session. All X11 windows will be forwarded from the main compute node to your client PC for viewing.

8. The "terminal 2" session can be terminated and re-established, as needed, so long as the PBS job is still running. Once the PBS job runs out of time, or the "terminal 1" session exits, the reserved nodes will be released, and the process must be repeated from step 1 to start another session.

Back to Top

3. Using Terminal 1

6. In terminal 2, determine the value of the environmental variable, DISPLAY as so:

 $ printenv DISPLAY
 localhost:IJ.0

Here IJ is some set of digits.

7. Now in terminal 1, set the environmental variable, DISPLAY to match:

 $ export DISPLAY=localhost:IJ.0

8. At this point, use terminal 1 for your interactive session commands; all X11 windows will be forwarded from the main compute node to the client PC.

9. The "terminal 2" session can be terminated and re-established, as needed, so long as the PBS job is still running. Once the PBS job runs out of time, or the "terminal 1" session exits, the reserved nodes will be released, and the process must be repeated from step 1 to start another session.

Back to Top

4. The Batch Method

Sometimes an interactive session is not sufficient. In this case, it is possible to latch on to a batch job submitted in the traditional sense. This example shows how to reserve a set of nodes via the batch scheduler. Interactive access to the machine, with a properly set environment, is accomplished by taking the following steps.

Note: this method only requires 1 terminal.

1. Login to the head node of the desired x86 Linux cluster:

 $ ssh -XY username@tezpur.hpc.lsu.edu

2. Once on the head node, create a job script, calling it something like interactive.pbs, containing the following. This is a job that simply sleeps and wakes to spin time:

#!/bin/sh
#PBS -A allocation_account
echo "Changing to directory from which script was submitted."
cd $PBS_O_WORKDIR
# create bash/sh environment source file
H=`hostname`
# -- add host name as top line
echo "# main node: $H"  > ${PBS_JOBID}.env.sh
# -- dump raw env
env | grep PBS         >> ${PBS_JOBID}.env.sh
# -- cp raw to be used for csh/tcsh resource file
cp ${PBS_JOBID}.env.sh ${PBS_JOBID}.env.csh
# -- convert *.sh to sh/bash resource file
perl -pi -e 's/^PBS/export PBS/g' ${PBS_JOBID}.env.sh
# -- convert *.csh to csh/tcsh resource file
perl -pi -e 's/^PBS/setenv PBS/g' ${PBS_JOBID}.env.csh
perl -pi -e 's/=/ /g' ${PBS_JOBID}.env.csh
# -- entering into idle loop to keep job alive
while [ 1 ]; do
  sleep 10 # in seconds
  echo hi... > /dev/null
done

3. Submit the script saved in step #2:

 $ qsub -V -l walltime=00:30:00,nodes=1:ppn=4 interactive.pbs

4. You can check for when the job starts using qstat, and when it does, the following happens:

  • 2 files are created in the current directory that contain the required environmental variables:
    • <jobid>.env.sh
    • <jobid>.env.csh
  • the job is kept alive by the idle while loop

5. Determine the main compute node being used by the job by inspecting the top line of either of the 2 environment files

 $ % head -n 1 <jobid>.env.sh
 # main node: tezpurIJK

Where IJK is some set of digits.

6. Login to the host specified in step 5; and be sure to note the directory from which the job was submitted:

 $ ssh -XY tezpurIJK

7. Source the proper shell environment

 $ . /path/to/<jobid>.env.sh

8. Ensure that all the PBS_* environment variables are set. For example:

 $ env | grep PBS 
 PBS_JOBNAME=dumpenv.pbs
 PBS_ENVIRONMENT=PBS_BATCH
 PBS_O_WORKDIR=/home/estrabd/xterm
 PBS_TASKNUM=1
 PBS_O_HOME=/home/estrabd
 PBS_MOMPORT=15003
 PBS_O_QUEUE=workq
 PBS_O_LOGNAME=estrabd
 PBS_O_LANG=en_US.UTF-8
 PBS_JOBCOOKIE=B413DC38832A165BA0E8C5D2EC572F05
 PBS_NODENUM=0
 PBS_O_SHELL=/bin/bash
 PBS_JOBID=9771.tezpur2
 PBS_O_HOST=tezpur2
 PBS_VNODENUM=0
 PBS_QUEUE=workq
 PBS_O_MAIL=/var/spool/mail/estrabd
 PBS_NODEFILE=/var/spool/torque/aux//9771.tezpur2
 PBS_O_PATH=... # not shown due to length

9. Now this terminal can be used for interactive commands; all X11 windows will be forwarded from the main compute node to the client PC

Back to Top

The methods outlined above are particularly useful with the debugging tutorial.

Back to Top


Users may direct questions to sys-help@loni.org.

Useful PBS Commands

1. qsub for submitting job

The command qsub is used to send a batch job to PBS. The basic usage is

qsub pbs.script 

where pbs.script is the script users write to specify their needs. qsub also accept command line arguments, which will overwrite those specified in the script, for example, the following command

qsub myscript -A my_LONI_allocation2

will direct the system to charge SUs (service units) to the allocation my_LONI_allocation2 instead of the allocation specified in myscript.

Back to Top

2. qstat for checking job status

The command qstat is used to check the status of PBS jobs. The simplest usage is

qstat

which would give informations similar to the following:

Job id              Name             User            Time Use S Queue
------------------- ---------------- --------------- -------- - -----
2572.eric2          s13pic           cott            00:00:00 R checkpt
2573.eric2          s13pib           cott            00:00:00 R checkpt
2574.eric2          BHNS02_singleB   palenzuela             0 Q checkpt
2575.eric2          BHNS02_singleC   palenzuela      00:00:00 R checkpt
2576.eric2          BHNS02_singleE   palenzuela      00:00:00 R checkpt
2577.eric2          BHNS02_singleF   palenzuela      00:00:00 R checkpt
2578.eric2          BHNS02_singleD   palenzuela      00:00:00 R checkpt
2580.eric2          s13pia           cott                   0 Q workq

The first column to the six column show the id of each job, the name of each job, the owner of each job, the time consummed by each job, the status of each job (R corresponds to running, Q correcponds to in queue ), and which queue each job is in. qstat also accepts command line arguments, for instance, the following usage gives more detailed information regarding jobs.

[ou@eric2 ~]$ qstat -a
eric2:
                                                                   Req'd  Req'd   Elap
Job ID               Username Queue    Jobname    SessID NDS   TSK Memory Time  S Time
-------------------- -------- -------- ---------- ------ ----- --- ------ ----- - -----
2572.eric2           cott     checkpt  s13pic      28632     6   1    --  48:00 R 24:51
2573.eric2           cott     checkpt  s13pib      13753     6   1    --  48:00 R 15:29
2574.eric2           palenzue checkpt  BHNS02_sin    --      8   1    --  48:00 Q   --
2575.eric2           palenzue checkpt  BHNS02_sin  10735     8   1    --  48:00 R 08:04
2576.eric2           palenzue checkpt  BHNS02_sin  30726     8   1    --  48:00 R 07:52
2577.eric2           palenzue checkpt  BHNS02_sin  24719     8   1    --  48:00 R 07:51
2578.eric2           palenzue checkpt  BHNS02_sin  23981     8   1    --  48:00 R 07:31
2580.eric2           cott     workq    s13pia        --      6   1    --  48:00 Q   --

Back to Top

3. qdel for cancelling a job

To cancel a PBS job, enter the following command.

qdel job_id [job_id] ...

Back to Top

4. qfree to query free nodes in PBS

One useful command for users to schedule their jobs in an optimal way is "qfree", which shows free nodes in each queue. For example,

[ou@eric2 ~]$ qfree
PBS total nodes: 128,  free: 14,  busy: 111 *3,  down: 3,  use: 86%
PBS checkpt nodes: 128,  free: 14,  busy: 98
PBS workq nodes: 64,  free: 14,  busy: 10
PBS single nodes: 16,  free: 14,  busy: 1
(Highest priority job on queue workq will start in 6:47:09)

shows that there total 14 free nodes in PBS, they are available in all the three queues: checkpt, workq and single.

Back to Top

5. showstart for estimating the starting time for a job

The command showstart can be used to get an approximate estimation of the starting time of your job, the basic usage is

showstart job_id

The following shows an simple example:

[ou@eric2 ~]$ showstart 2928.eric2
job 2928 requires 16 procs for 1:00:00:00 
Estimated Rsv based start in                 7:28:18 on Wed Jun 27 16:46:21
Estimated Rsv based completion in         1:07:28:18 on Thu Jun 28 16:46:21 
Best Partition: base

Back to Top


Users may direct questions to sys-help@loni.org.

Job Queuing Priority

The queuing system schedules jobs based on the job priority which takes in account several factors. Jobs with a higher job priority are scheduled ahead of jobs with a lower priority. Also it has a backfill capability when scheduling jobs that are short in duration or require a small number of nodes. That is the scheduler schedules small jobs while waiting for the start time of any large job requiring many nodes. In determining which jobs to run first, Moab is using the following formula to calculate job priority:

Job priority = credential priority + fairshare priority + resource priority + service priority

(1) Credential Priority Subcomponent:

credential priority = credweight * (userweight * job.user.priority) credential priority = 100 * (10 * 100) = 100000 ( a constant )

(2) Fairshare Priority Subcomponent:

fairshare priority = fsweight * min (fscap, (fsuserweight * DeltaUserFSUsage)) fairshare priority = 100 * (10 * DeltaUserFSUsage)

A user's fair share usage is the sum of seven days of used daily processor seconds times daily decay factor divided by the sum of seven days of daily total processor seconds used times the daily decay factor. The decay factor is 0.9. DeltaUserFSUsage is the fair share target percent for each user (20 percent) minus the the calculated fair share usage percent. In other words the target percentage minus the actual used percentage. For a user who has not used the cluster for a week:

fairshare priority = 100 * (10 * 20) = 20000

(3) Resource Priority Subcomponent:

resource priority = resweight * min (rescap, (procweight * TotalProcessorsRequested) resource priority = 30 * min (3840, (10 * TotalProcessorsRequested)

For instance, for a 32 processor job:

resource priority = 30 * 10 * 32 = 9600

(4) Service Priority Subcomponent:

service priority = serviceweight * (queuetimeweight * QUEUETIME + xfactorweight * XFACTOR ) service priority = 2 * (2 * QUEUETIME + 20 * XFACTOR) QUEUETIME is the time the job has been queued in minutes. XFACTOR = 1 + QUEUETIME / WALLTIMELIMIT

For a one hour job in the queue for one day:

service priority = 2 * (2 * 1440 + 20 * (1 + 1440 / 60 ) ) service priority = 2 * (2880 + 500 ) = 6760

These factors are adjusted as needed to make jobs of all sizes start fairly.