Job Execution

Overview

When a batch job is submitted from a work server, scheduling is handled by LSF (Platform Load Sharing Facility developed by IBM) which dispatches and executes the batch job on a calculation server. When submitting a job, it is necessary to choose a queue that is appropriate.

※See here for information on batch queues for workgroups.

Job Submission

Notes

In case that you submit over 100 jobs which run time are very short (less than 30 sec),these jobs should be merged into a shell script and submitted to LSF. (You can use LSF Session Scheduler if you submit the jobs to s queue) If many submitted jobs have run time shorter than the procedure interval of LSF manager service, these jobs have significant impact on system performance.
When the availability of each servers' memory exceeds 90%, a series of processes which use large amount of memory may be forced to terminate for system protection. We appreciate your understandings.
If free-space of your home directory less than 100MB, you cannot submit jobs temporarily due to prevent capacity depletion and job failure.
You can submit jobs again when the free-space of your directory is over 100MB.
Please check your status of home directory by Disk Quotas command.

Submit a Job

To submit a job into LSF, run the following command.

 $> bsub [-q queue name -m "hostname" ] job name

[Example]

 $> bsub -q s testjobs

※ Normally, it is not necessary to specify a host name (i.e. the host where the job is executed) when submitting a job. LSF manages selectiion of appropriate calculation servers.

※ You can't submit a job from calculation servers.

Also, this system supports jobs using OpenMPI/OpenMP.
See Executing MPI Jobs or Executing OpenMP Jobs .

Standard Output/Standard Error

It is not suitable output to HSM area.
So, if you conflict with these rule:

・To Specify output to HSM area
・To Execute bsub on HSM area

bsub will fail with the following messages:

--------------------------------------------------------- 
HSM area is not suitable to output logfiles from jobs.
It may cause a system crash.
Would you check your jobs and job-submission environment?
See: http://kekcc.kek.jp/service/kekcc/html/Eng/JobExecution.html#sa89ac76
Thank you for your understanding and cooperation.
--------------------------------------------------------- 

Default

By default, jobs are output as files without sending an email.
The output file is “Job-ID.out” under the directory .lsf in your home directory.

Example: Group: CE, User: testuser1, Job ID: 111
    /home/ce/testuser1/.lsf/111.out

File Specification

You can specify a file name or a directory for the output file with the "-o" option. When only a directory is specified, the output file name is "Job-ID.out."

 $> bsub -o "filename or directoryname" jobname

[Example]

 $> bsub -o lsflog/ tetestjobs

※ The output file is created under lsflog, which is a sub-directory of the current directory. The file name is "Job-ID.out".

[Example2]

 $> bsub -o lsflog/result.log tetestjobs2

※ The output file is created under lsflog, which is a sub-directory of the current directory. The file name is "result.log".

In case the specified output file already exists

If the same output file name is specified for several different jobs, the results are added at the end of the existing file

Sending by email

To send the standard output and standard error of jobs by email, indicate an email address as shown below.

 $> bsub -u email address jobname

[Example]

 $> bsub -u testuser01@post.kek.jp testjobs

Job Status

You can check the job status with the bjobs command.

 $> bjobs jobID

[Example]

 $> bjobs 111

If your jobs ware started, please check their status by using a command "/usr/local/bin/chk_runjob".
※This information is updated once per minute.

$> chk_runjob
 JOBS  SLOT    CPUTIME    RUNTIME  CPU/RUN
 1500  1500  125649261  126740225    0.991

You can list your jobs in order of STARTTIME with "-l NUM" option.
NUM means the number of listed jobs.(default:show all)

$> chk_runjob -l 5
 JOBS  SLOT    CPUTIME    RUNTIME  CPU/RUN
 1500  1500  125649261  126740225    0.991

RUN JOB SUMMARY:
JOBID     QUEUE      EXEC_HOST    STARTTIME       CPUTIME    RUNTIME     UTIL    EXEC_JOB
30351511  l          cb241        03/31-09:22:21  023:22:00  023:31:09   99.35%  java
30351564  l          cb241        03/31-09:22:42  023:21:36  023:30:48   99.35%  java
30351774  l          cb241        03/31-09:23:50  023:18:59  023:29:40   99.24%  java
30351978  l          cb241        03/31-09:25:03  023:19:30  023:28:27   99.36%  java
30352009  l          cb241        03/31-09:25:23  023:16:06  023:28:07   99.15%  java

  You can get details of your jobs with the "bjobs -l JOBID" command.

In a following case, execution efficiency becomes very low and CPU/RUN also falls.

o Mass or continuous submission of the  extremely short (10-30 seconds or less) jobs.
o Submisson of jobs that cannot be run successfully (ex.command mistake)
o Too many jobs wire to one directory. 
o Too many jobs read one file conqurently.
o Too many files in $HOME/.lsf/ .
o Home directory is full. --> you can check by "rquota", "bquota" ( = Only for Belle users ) or "hquota" command.

If "CPU/RUN < 0.6", please improve the utilization efficiency by performing the investigation of the cause.

Also, you can monitor the standard output during the job execution with the bpeek command.

 $> bpeek jobID

[Example]

 $> bpeek 111

You can get information on jobs with the "bhist -l" command after the job has finished.
If you get the message "No matching job found", you can add the option "-n 0" to look in all the previous LSF logs.

 $> bhist -l jobID

[Example]

 $> bhist -l 111

You can get information on jobs also with the "bacct -l" command.

 $> bacct -l jobID

[Example]

 $> bacct -l 111

Host Status

You can display load information for hosts using the lsload command.

 $> lsload↓ 

[Example]

 $> lsload↓ 
HOST_NAME  status  r15s   r1m  r15m   ut    pg  ls    it   tmp   swp  mem
cb001          ok   0.1   0.2   0.0   0%   6.4   0     2 9016M 2047M  43G
cb002          ok   0.2   0.0   0.0   0%  10.2   0     2 9016M 2047M  43G
cb003          ok   0.3   0.0   0.0   0%   8.6   0     2 9016M 2047M  43G
.....

You can check the status of each host using the bhosts command.

 $> bhosts↓ 

[Example]

 $> bhosts↓ 
HOST_NAME   STATUS       JL/U    MAX  NJOBS    RUN  SSUSP  USUSP    RSV
cb001       ok              -     12      0      0      0      0      0
cb002       ok              -     12      0      0      0      0      0
cb003       ok              -     12      0      0      0      0      0
cb004       ok              -     12      0      0      0      0      0
cb005       ok              -     12      0      0      0      0      0
.....

Cancellation

You can cancel jobs using the bkill command.

 $> bkill jobID

When you set "0" instead of "jobID", all jobs you submitted will be canceled.

[Example]

 $> bkill 111

If bkill fails to kill a job, the "-r" option can be used to force the termination. This option applies to situations such as;

  • When executed bkill command, there is a message "Job <JOBID> is being terminated"
  • bjobs command keeps showing the job that should have been terminated
 $> bkill -r jobID

[Example]

 $> bkill -r 111

Termination of jobs by system

  • The jobs will be terminated for over the following thresholds by system.
Reason of terminationMessages in bacct *4
Reached maximum normalized execution time *1Completed <exit>; TERM_RUNLIMIT: job killed after reaching LSF run time limit.
Reached maximum normalized CPU time *1Completed <exit>; TERM_CPULIMIT: job killed after reaching LSF CPU usage limit.
Reached maximum resident set size (RSS) of a job *1Completed <exit>; TERM_MEMLIMIT: job killed after reaching LSF memory usage limit.
Reached maximum virtual memory of a job *1Completed <exit>; TERM_SWAPLIMIT: job killed after reaching LSF memory usage limit.
Reached maximum processes of a job *1Completed <exit>; TERM_PROCESSLIMIT: job killed after reaching LSF process limit.
Reached maximum STDOUT output size *2Completed <exit>; TERM_FORCE_ADMIN: job killed by root or LSF administrator without time for cleanup.
Reached maximum SUSPEND time *3Completed <exit>; TERM_FORCE_ADMIN: job killed by root or LSF administrator without time for cleanup.

 *1 About values of those limitation, please see: Batch Queue List
 *2 Current STDOUT output limit size is 200MB.
 *3 Current SUSPEND time limit size is 3hours.
 *4 The way of execution bacct command for checking a status of jobs:

 $> bacct -l JOBID

Scheduling Policies

LFS supports the "Fairshare" functions in order to provide a fair access to resources. Fairshare defines the priorities of jobs that are dispatched. Fairshares are calculated as shown below.

Dynamic priority =number_shares /
         ( CPU_time * CPU_TIME_FACTOR +
         Run_time * RUN_TIME_FACTOR +
         (1 + job_slots) * RUN_JOB_FACTOR +
         (RUN_TIME - CPU_TIME)  * CPU_JOB_FACTOR)
  • number_shares : number of shares assigned to the user.
  • CPU time :The cumulative CPU time used by the user (measured in hours).
  • run_time : The total run time of running jobs (measured in hours)
  • job_slots : The number of job slots reserved and in use.
  • CPU_TIME_FACTOR ........ The CPU time weighting factor. (default: 0.0)
  • RUN_TIME_FACTOR ........ The run time weighting factor. (default: 0.35)
  • RUN_JOB_FACTOR ........ The job slots weighting factor. (default: 3)
  • CPU_JOB_FACTOR ........ CPU Utilization efficiency factor. (default: 1)

The following command shows the current Fairshare values.

 $> bqueues -l queue_name

[Example]

 $> bqueues -l s
SHARE_INFO_FOR: s/
 USER/GROUP   SHARES  PRIORITY  STARTED  RESERVED  CPU_TIME  RUN_TIME
 user-a          1       0.333      0        0         0.0        0
 user-b          1       0.333      0        0         0.0        0
 user-c          1       0.333      0        0         0.0        0
 user-d          1       0.012      1        0       872.8   399993
 user-e          1       0.008      1        0         0.0   574239
 user-f          1       0.005      2        0      2588.0  1003056
 user-g          1       0.004     31        0     36821.6   667907
 user-h          1       0.003     98        0     94953.1   432167
 user-i          1       0.001     17        0     17307.1  3447528
 user-j          1       0.000     27        0     23340.3 12779340

Guaranteed Resource Allocation (GRA)

GRA (Guaranteed Resource Allocation) policy is applied in the a queue.
Jobs in this queue are dispatched in preference to other queues, but
running jobs per user is limited to 2 (JL/U).
This queue is useful for jobs that need quick results when waiting job
dispatch in the normal queues.
Note that the queue parameters are the same as the l queue.

MPI Job Execution

Job Submission

In the current computation system, it is possible to submit jobs that use OpenMPI/IntelMPI to LSF.
Regarding how to compile MPI applications, please refer to MPI Compilation.

Steps to rub MPI jobs.

Create job script file is needed to run MPI jobs.
The arguments for bsub command can discribed in the job script.
Here the way to describe in a script is shown.

Creating a job script

Set the environment with module command.
OpenMPI varies depends on compilers, chose the correct compiler used to build with.
Then run the programs with mpirun command.

[Example/IntelMPI]

#!/bin/bash 
##---- for bsub options
#BSUB -n 24  #slot number  
#BSUB -q p   #execution queue 
##---- environment setting
module load intel 
##---- execution
mpirun -np 24 ./a.out  

[Example/OpenMPI(gnu)]

#!/bin/bash 
##---- for bsub options 
#BSUB -n 24 #slot number 
#BSUB -q p  #execution queue 
##---- environment setting
module load openmpi/1.10.2-gcc 
##---- execution 
mpirun -np 24 ./a.out  

Job submission

 $> bsub < job script

[Example]

 $> cat mpijob_intel.sh
 #!/bin/bash 
 #BSUB -n 24  
 #BSUB -q p 
 module load intel 
 mpirun -np 24 ./a.out 
 $>
 $> bsub < mpijob_intel.sh

Limits of queues dedicated to parallel (MPI) jobs

MPI jobs must be submitted to the dedicated queues, p or px. When executing bsub, always specify either "-q p" or "-q px".

For the queues p and px, PROCLIMIT (Limit on the number of slots used per job) is set to 24. Thus, parallel jobs using more than 24 processes cannot be submitted.

Your MPI job may get these errors and exit immediately after you submitted it. They show you to conflict with restriction of the system temporarily. Please re-submit MPI jobs again when you get them.

ipath_userinit: assign_context command failed: Network is down
can't open /dev/ipath, network down (err=26)
ipath_userinit: assign_context command failed: Invalid argument
Driver initialization failure on /dev/ipath (err=23)

Selecting the Execution Nodes for a Parallel (MPI) Job

The KEK environment has been configured to assign job slots to MPI jobs as explained below. First, slots from the first node are assigned to the job, if the job requirements exceed the maximum number of slots per node, which is 12, slots from other nodes are also used.

Example) Case where a 12 process parallel job is submitted:

node01 ... 12 processes are executed
node02 ... No process 
node03 ... No process
node04 ... No process
       ...
node12 ... No process

To have each process of the 28 process MPI job dispatched on 1 slot per node, add the option -R span[ptile=1] to the bsub command.
It can be specified by the argument or in the job script.

[by argument]

 $> bsub -R span[ptile=1] < job script

[specified in the job script]
Discribe the following line in the job script.

 #BSUB -R span[ptile=1] 

With this option, it is possible to have the MPI job dispatched as below.

node01 ... 1 process is executed
node02 ... 1 process is executed
node03 ... 1 process is executed
node04 ... 1 process is executed
       ...
node12 ... 1 process is executed

OpenMP Job Execution

Regarding how to compile OpenMP program applications, please refer to OpenMP program Compilation.

OpenMP job submission

OpenMP jobs can use the number of CPU cores specified by bsub -n ( default value is 1 ).
The maximum number of cores for OpenMP jobs is 24 ( per jobs ).

[Example ( Submitting OpenMP job to use 12 cores )]

 $> bsub -R "span[hosts=1]" -q p -n 12 ./openmp_64

Limits of queues dedicated to OpenMP jobs

OpenMP jobs must be submitted to the dedicated queues, p or px. When executing bsub, always specify either "-q p" or "-q px". And when executing bsub, you need to specify "-R "span[hosts=1]", too.

Using LSF Session Scheduler

About LSF Session Scheduler

You can use LSF Session Scheduler (LSF SS) if you submit many jobs which satisfy following conditions:

  • short RUNTIME (shorter than 30 seconds)
  • do not use much CPU resources LSF SS makes these jobs get effectively.

Preparation

To use LSF SS, you have to make task file. This file is a list you want to execute in your job.

Example:my.task
# cat my.task
hostname
date
/home/xxx/user/script.sh

If you want to get the output of each commands, please use -o (standard output) or -e (standard error) options. For LSF SS, The output file of $HOME/.lsf have no output of command execution.

Example:my.task2
# cat my.task2
-o sample.out -e sample.err hostname

For above sample, sample.out contains standard output, and sample.err contains standard error.

In addition, you can use following parameters.

%J : Job ID
%T : task ID

Job ID is JOBID of execute job, and task ID is the row of task file. For example, you can get four files, 10000.1.out,10000.1.err,10000.2.out,10000.2.err, if you submit follwoing task file.

例:my.task3
# cat my.task3
-o %J.%T.out -e %J.%T.err hostname
-o %J.%T.out -e %J.%T.err date

Job Submission

It shows the procedure to submit LSF SS job. You have to use s queue to use LSF SS.

$ bsub -app ssched <LSF option> ssched -tasks <task file>
例: $ bsub -app ssched -q s ssched -tasks my.taks3

If you want to check job status, or cancel jobs, you can do them same as normal LSF job.

Others

If you submit LSF SS job to "not s queue", it has following errors.

LSF is Rejecting your job submission...
ssched is available for only s queue. Job not submitted.

Additional Warnings

/tmp directory

/tmp shall be the location that makes basic system applications available to create temporary files.
Please refrain from creating large size files or number of files in the /tmp by yourself.
If the usage rate of the / tmp directory becomes high, large files are deleted.

File staging during LSF jobs

Submitting many jobs reading "not staged files" (which do not exist on GHI disk) will make KEKCC system inefficiency.
If you can determine files using in your jobs, please stage files refer to "Staging files", or submit a job refer to Submit a Job after staging files


Last-modified: 2019-03-07 (木) 16:00:52 (13d)