Job Execution

Overview

When a batch job is submitted from a work server, scheduling is handled by LSF (Platform Load Sharing Facility developed by IBM) which dispatches and executes the batch job on a calculation server. When submitting a job, it is necessary to choose a queue that is appropriate.

※See here for information on batch queues for workgroups.

Job Submission

Note

Info

In case that you submit over 100 jobs which run time are very short (less than 30 sec),these jobs should be merged into a shell script and submitted to LSF. (You can use LSF Session Scheduler if you submit the jobs to s queue) If many submitted jobs have run time shorter than the procedure interval of LSF manager service, these jobs have significant impact on system performance.

Info

When the availability of each servers' memory exceeds 90%, a series of processes which use large amount of memory may be forced to terminate for system protection. We appreciate your understandings.

Info

If free-space of your home directory less than 100MB, you cannot submit jobs temporarily due to prevent capacity depletion and job failure. You can submit jobs again when the free-space of your directory is over 100MB. Please check your status of home directory by Disk Quotas command.

Submit a Job

To submit a job into LSF, run the following command.

$> bsub [-q queue name -m "hostname" ] job name↓

［Example］

$> bsub -q s testjobs↓

Normally, it is not necessary to specify a host name (i.e. the host where the job is executed) when submitting a job. LSF manages selectiion of appropriate calculation servers.
You can't submit a job from calculation servers.

Also, this system supports jobs using MPI. See Executing MPI Jobs .

Sending completion notice by email

Info

If a large amount of notification emails are detected, the system administrator will delete them.

To send the completion notification from jobs by email, specify these options as shown below.
You must specify -o option to output the results as files, when you want to get notifications.
For detail, see Standard Output/Standard Error .

$> bsub -q s -u "email address" -o "filename or directoryname" -N jobname↓

［Example］

$> bsub -q s -u testuser01@post.kek.jp -o lsflog/result.log -N testjobs↓

Standard Output/Standard Error

It is not suitable output to HSM area.
So, if you conflict with these rule:

To Specify output to HSM area
To Execute bsub on HSM area

bsub will fail with the following messages:

--------------------------------------------------------- 
HSM area is not suitable to output logfiles from jobs.
It may cause a system crash.
Would you check your jobs and job-submission environment?
See: http://kekcc.kek.jp/service/kekcc/html/Eng/JobExecution.html#sa89ac76
Thank you for your understanding and cooperation.
---------------------------------------------------------

Default

By default, the output file is “Job-ID.out” under the directory .lsf in your home directory.

Example： Group: CE, User: testuser1, Job ID: 111
''/home/ce/testuser1/.lsf/111.out''

File Specification

You can specify a file name or a directory for the output file with the "-o" option. When only a directory is specified, the output file name is "Job-ID.out."

 $> bsub -o "filename or directoryname" jobname↓

［Example］

$> bsub -o lsflog/ tetestjobs↓

※ The output file is created under lsflog, which is a sub-directory of the current directory. The file name is "Job-ID.out".

［Example2］

$> bsub -o lsflog/result.log tetestjobs2↓

※ The output file is created under lsflog, which is a sub-directory of the current directory. The file name is "result.log".

In case the specified output file already exists

If the same output file name is specified for several different jobs, the results are added at the end of the existing file

Limitation of output file size

When you use Default and File Specification , the output file size is limited up to 200MB by system.
To avoid it, you can control output file in your program or use shell redirection.

[Example] with csh

$> bsub "testjob >& lsflog/result.log"↓

[Example] with bash

$> bsub "testjob > lsflog/result.log 2>&1"↓

Job Status

You can check the job status with the bjobs command.

$> bjobs jobID↓

[Example]

$> bjobs 111↓

If your jobs ware started, please check their status by using a command "/usr/local/bin/chk_runjob".
※This information is updated once per minute.
※The multi-core/multi-thread jobs, such as MPI/OpenMP, are not supported.

$> chk_runjob
 JOBS  SLOT    CPUTIME    RUNTIME  CPU/RUN
 1500  1500  125649261  126740225    0.991

You can list your jobs in order of STARTTIME with "-l NUM" option.
NUM means the number of listed jobs.(default:show all)

$> chk_runjob -l 5
 JOBS  SLOT    CPUTIME    RUNTIME  CPU/RUN
 1500  1500  125649261  126740225    0.991

RUN JOB SUMMARY:
JOBID     QUEUE      EXEC_HOST    STARTTIME       CPUTIME    RUNTIME     UTIL    EXEC_JOB
30351511  l          cb241        03/31-09:22:21  023:22:00  023:31:09   99.35%  java
30351564  l          cb241        03/31-09:22:42  023:21:36  023:30:48   99.35%  java
30351774  l          cb241        03/31-09:23:50  023:18:59  023:29:40   99.24%  java
30351978  l          cb241        03/31-09:25:03  023:19:30  023:28:27   99.36%  java
30352009  l          cb241        03/31-09:25:23  023:16:06  023:28:07   99.15%  java

You can get details of your jobs with the "bjobs -l JOBID" command.

In a following case, execution efficiency becomes very low and CPU/RUN also falls.

Mass or continuous submission of the extremely short (10-30 seconds or less) jobs.
Submisson of jobs that cannot be run successfully (ex.command mistake)
Too many jobs wire to one directory.
Too many jobs read one file conqurently.
Too many files in $HOME/.lsf/ .
Home directory is full. --> you can check by "rquota", "bquota" ( = Only for Belle users ) or "hquota" command.

If "CPU/RUN < 0.6", please improve the utilization efficiency by performing the investigation of the cause.

Also, you can monitor the standard output during the job execution with the bpeek command.

$> bpeek jobID↓

［Example］

$> bpeek 111↓

You can get information on jobs with the "bhist -l" command after the job has finished.
If you get the message "No matching job found", you can add the option "-n 0" to look in all the previous LSF logs.

$> bhist -l jobID↓

［Example］

$> bhist -l 111↓

You can get information on jobs also with the "bacct -l" command.

$> bacct -l jobID↓

［Example］

$> bacct -l 111↓

Host Status

You can display load information for hosts using the lsload command.

$> lsload↓

［Example］

$> lsload↓

HOST_NAME       status  r15s   r1m  r15m   ut    pg  ls    it   tmp   swp   mem
cb042               ok   0.0   0.0   0.0   0%   0.0   0  2714   40G 23.9G  822G
cb013               ok   0.0   0.0   0.0   0%   0.0   0  2273   40G 23.9G  820G
cb019               ok   0.0   0.0   0.0   0%   0.0   0  2283   40G 23.9G  820G
.....

You can check the status of each host using the bhosts command.

$> bhosts↓

［Example］

$> bhosts↓

HOST_NAME          STATUS       JL/U    MAX  NJOBS    RUN  SSUSP  USUSP    RSV
cb001              ok              -    192      0      0      0      0      0
cb002              ok              -    192      0      0      0      0      0
cb003              ok              -    192      0      0      0      0      0
cb004              ok              -    192    100    100      0      0      0
cb005              ok              -    192      0      0      0      0      0
.....

Cancellation

You can cancel jobs using the bkill command.

$> bkill jobID↓

When you set "0" instead of "jobID", all jobs you submitted will be canceled.

［Example］

$> bkill 111↓

If bkill fails to kill a job, the "-r" option can be used to force the termination. This option applies to situations such as;

When executed bkill command, there is a message "Job is being terminated"
bjobs command keeps showing the job that should have been terminated

$> bkill -r jobID↓

［Example］

$> bkill -r 111↓

Termination of jobs by system

The jobs will be terminated for over the following thresholds by system.

Reason of termination	Messages in bacct ⁴
Reached maximum normalized execution time ¹	Completed ; TERM_RUNLIMIT: job killed after reaching LSF run time limit.
Reached maximum normalized CPU time ¹	Completed ; TERM_CPULIMIT: job killed after reaching LSF CPU usage limit.
Reached maximum resident set size (RSS) of a job ¹	Completed ; TERM_MEMLIMIT: job killed after reaching LSF memory usage limit.
Reached maximum virtual memory of a job ¹	Completed ; TERM_SWAPLIMIT: job killed after reaching LSF memory usage limit.
Reached maximum processes of a job ¹	Completed ; TERM_PROCESSLIMIT: job killed after reaching LSF process limit.
Reached maximum STDOUT output size ²	Completed ; TERM_FORCE_ADMIN: job killed by root or LSF administrator without time for cleanup.

Scheduling Policies

LFS supports the "Fairshare" functions in order to provide a fair access to resources. Fairshare defines the priorities of jobs that are dispatched. The following command shows the current Fairshare values.

$> bqueues -l queue_name↓

［Example］

$> bqueues -l s↓

SHARE_INFO_FOR: s/
 USER/GROUP   SHARES  PRIORITY  STARTED  RESERVED  CPU_TIME  RUN_TIME   ADJUST  GPU_RUN_TIME
 user-a          1       0.333      0        0         0.0        0      0.000             0
 user-b          1       0.333      0        0         0.0        0      0.000             0
 user-c          1       0.333      0        0         0.0        0      0.000             0
 user-d          1       0.012      1        0       872.8   399993      0.000             0
 user-e          1       0.008      1        0         0.0   574239      0.000             0
 user-f          1       0.005      2        0      2588.0  1003056      0.000             0
 user-g          1       0.004     31        0     36821.6   667907      0.000             0
 user-h          1       0.003     98        0     94953.1   432167      0.000             0
 user-i          1       0.001     17        0     17307.1  3447528      0.000             0
 user-j          1       0.000     27        0     23340.3 12779340      0.000             0

Guaranteed Resource Allocation (GRA, Guaranteed Resource Allocation)

GRA (Guaranteed Resource Allocation) policy is applied in the a queue.
Jobs in this queue are dispatched in preference to other queues, but running jobs per user is limited to 4 (JL/U).
This queue is useful for jobs that need quick results when waiting job dispatch in the normal queues.
Note that the queue parameters are the same as the l queue.

How to use large memory usage

In the case of you use the queues s, l, h, p, z, cmb_p, th, p400 and g, according to the amount of memory you are using. Specify the bsub option -n "parallel number X" within the TASKLIMIT setting value and submit the job.

［Example/s queue］

6 for TASKLIMIT is set.

The available memory is as follows.

Specify 1 -> 8GB * 1 = MEMLIMIT =  8GB
Specify 2 -> 8GB * 2 = MEMLIMIT = 16GB
Specify 3 -> 8GB * 3 = MEMLIMIT = 24GB
Specify 4 -> 8GB * 4 = MEMLIMIT = 32GB
Specify 5 -> 8GB * 5 = MEMLIMIT = 40GB
Specify 6 -> 8GB * 6 = MEMLIMIT = 48GB

MPI Job Execution

Job Submission

In the current computation system, it is possible to submit jobs that use MPI to LSF.
Regarding how to compile MPI applications, please refer to MPI Compilation.

Steps to rub MPI jobs.

Create job script file is needed to run MPI jobs.
The arguments for bsub command can discribed in the job script.
Here the way to describe in a script is shown.

Creating a job script

Set the environment with module command.
OpenMPI varies depends on compilers, chose the correct compiler used to build with.
Then run the programs with mpirun command.

［Example］

#!/bin/bash
##---- for bsub options
#BSUB -n 24 #slot number
#BSUB -q p  #execution queue
##---- environment setting
module load intel/2024
##---- execution
mpirun -np 24 ./a.out

Job submission

$> bsub < job script↓

［Example］

$> cat mpijob_intel.sh  ↓ 
#!/bin/bash
#BSUB -n 24
#BSUB -q p 
module load intel/2024
mpirun -np 24 ./a.out 
$>
$> bsub < mpijob_intel.sh↓

Limits of queues dedicated to parallel (MPI) jobs

MPI jobs must be submitted to the dedicated queue p. When executing bsub, always specify "-q p".

For the queues p, TASKLIMIT (Limit on the number of tasks used per job) is set to 64. Thus, parallel jobs using more than 64 processes cannot be submitted.

Your MPI job may get these errors and exit immediately after you submitted it. They show you to conflict with restriction of the system temporarily. Please re-submit MPI jobs again when you get them.

ipath_userinit: assign_context command failed: Network is down
can't open /dev/ipath, network down (err=26)^

ipath_userinit: assign_context command failed: Invalid argument
Driver initialization failure on /dev/ipath (err=23)

Selecting the Execution Nodes for a Parallel (MPI) Job

The KEK environment has been configured to assign job slots to MPI jobs as explained below. First, slots from the first node are assigned to the job, if the job requirements exceed the maximum number of slots per node, which is 192, slots from other nodes are also used.

Example) Case where a 12 process parallel job is submitted:

node01 ... 12 processes are executed
node02 ... No process 
node03 ... No process
node04 ... No process
       ...
node12 ... No process

To have each process of the 12 process MPI job dispatched on 1 slot per node, add the option -R span[ptile=1] to the bsub command. It can be specified by the argument or in the job script.

[by argument]

$> bsub -R span[ptile=1] < job script↓

[specified in the job script]
Discribe the following line in the job script.

#BSUB -R span[ptile=1]

With this option, it is possible to have the MPI job dispatched as below.

node01 ... 1 process is executed
node02 ... 1 process is executed
node03 ... 1 process is executed
node04 ... 1 process is executed
       ...
node12 ... 1 process is executed

Using LSF Session Scheduler

About LSF Session Scheduler

You can use LSF Session Scheduler (LSF SS) if you submit many jobs which satisfy following conditions:

short RUNTIME (shorter than 30 seconds)
do not use much CPU resources LSF SS makes these jobs get effectively.

Preparation

To use LSF SS, you have to make task file. This file is a list you want to execute in your job.

Example:my.task
# cat my.task
hostname
date
/home/xxx/user/script.sh

If you want to get the output of each commands, please use -o (standard output) or -e (standard error) options. For LSF SS, The output file of $HOME/.lsf have no output of command execution.

Example:my.task2
# cat my.task2
-o sample.out -e sample.err hostname

For above sample, sample.out contains standard output, and sample.err contains standard error.

In addition, you can use following parameters.

%J : Job ID
%T : task ID

Job ID is JOBID of execute job, and task ID is the row of task file. For example, you can get four files, 10000.1.out,10000.1.err,10000.2.out,10000.2.err, if you submit follwoing task file.

Example:my.task3
# cat my.task3
-o %J.%T.out -e %J.%T.err hostname
-o %J.%T.out -e %J.%T.err date

Job Submission

It shows the procedure to submit LSF SS job. You have to use s queue to use LSF SS.

$ bsub -app ssched <LSF option> ssched -tasks <task file>

Example: $ bsub -app ssched -q s ssched -tasks my.taks3

If you want to check job status, or cancel jobs, you can do them same as normal LSF job.

Others

If you submit LSF SS job to "not s queue", it has following errors.

LSF is Rejecting your job submission...
ssched is available for only s queue. Job not submitted.

Additional Warnings

/tmp directory

/tmp shall be the location that makes basic system applications available to create temporary files. Please refrain from creating large size files or number of files in the /tmp by yourself. If the usage rate of the / tmp directory becomes high, large files are deleted.

File staging during LSF jobs

Submitting many jobs reading "not staged files" (which do not exist on GHI disk) will make KEKCC system inefficiency.
If you can determine files using in your jobs, please stage files refer to "Staging files".

About values of those limitation, please see: Batch Queue List~ ↩↩↩↩↩
Current STDOUT output limit size is 200MB.~ ↩
Current SUSPEND time limit size is 3hours.~ ↩
The way of execution bacct command for checking a status of jobs:
$> bacct -l JOBID↓ ↩