You are here

LSF Queues and Policy

Minerva uses LSF (Load Sharing Facility) to schedule jobs. This section gives details of the queues and scheduling policies on Minerva.

Queues


To check the queues available, type "bqueues"
To get more details about a specific queue, type "bqueues -l premium (replace by other queue name)"
Minerva Queue:

Queue Description Max Walltime Defaults (Mem/core|Walltime)
Jobs whose resources are to be charged against an allocation 144 hrs 2000MB | 5hrs
Jobs requesting high priority. Charged at 150% of alloc rate 144 hrs 2000MB | 5hrs
Jobs using allocation that require less than 2 hours walltime 6 hrs 2000MB | 1hr
Jobs running in interactive mode 12 hrs 2000MB | 2hrs
Jobs using dedicated resources 144 hrs 2000MB | 5hrs
Any other queues are for testing by the Scientific Computing Group N/A N/A

Chimera Queue:

Default memory per core is set as 3000MB for all the queues.
Queue Description Max Walltime Available Resources
Jobs requesting high priority with APS doubled as 200. Charged at 150% of alloc rate 144 hrs 200 nodes + 2 himem nodes
Jobs requiring less than 12 hours walltime 12 hrs 280 nodes
Jobs running in interactive mode 12 hrs 4 nodes + 1 GPU node
Jobs requiring more than 144 hours walltime 2 weeks 2 himem nodes
Jobs running on GPU nodes 144 hrs 44 V100
Jobs using dedicated resources unlimited private nodes

----
* The jobs that will run against an allocation should be submitted to the "alloc" queues. These jobs must have an appropriate account name specified with the "-P" flag.
Please see JOB EXECUTION section for example scripts.

* The jobs submitted to the low queue DO require an allocated account name and cannot be used with account name "scavenger" anymore.

* If a job is submitted to an alloc queue without specifying a valid account, the job will be put terminated. A message similar to:


Cannot open your job file: /tmp/98298349374.9849837
TERM_ADMIN: job killed by root or an administrator.
Exited with signal termination: Interrupt.

will appear in the output. You can verify that the termination was because of an invalid account number by executing the command:
bhist -l

Policies

LSF is configured to do “first fit” backfill. Backfilling allows smaller, shorter jobs to use otherwise idle resources.

In certain special cases, the priority of a job may be manually increased upon request. To request priority change you may contact MSSM Scientific Computing Support at hpchelp@hpc.mssm.edu. We will need the job ID and reason to submit the request.

Allocation

To check available allocations in seconds:


[minerva4~]$ mybalance
Balance     Name      
----------- --------- 

7921453013 acc_xxx  


To check available allocations in hours:


[minerva4 ~]$ mybalance -h
Balance     Name      
----------- --------- 

2200403.61 acc_xxx 


To see account description and account users:


[minerva4 ~]$ glsaccount -p acc_xxx
Name     Active Users            Organization 	Description        
-------- ------ ---------------  ------------ 	------------------ 
acc_xxx True   user01,user07  Ogranization_name   Account description


To check allocated funds and available funds per account and percent utilization:


[minerva4 ~]$ gbalance -p acc_xxx
Id Name     Available  Allocated  PercentUsed 
-- -------- ---------- ---------- ----------- 
* acc_xxx  7921453013 8100000000 2.20   


To check the usage statement (gstatement --man):


[minerva4 ~]$ gstatement -p acc_xxx


How do I check the CPU utilization for my LSF jobs?

If your LSF jobs run out of wall time, you can calculate the CPU utilization to help determine whether your jobs are really short on time, or if the CPU has been idle for most of the wall time waiting for I/O on the file system.

In your LSF output file*, there will be reports on “CPU time” and “Run time”
*the output file is the one you specified with #BSUB -o %J.stdout

To calculate the CPU utilization, you can use:

CPU utilization = cpu time /nRun timeX 100%

n is the number of CPU cores used by your LSF jobs, and is usually the one you specified with “#BSUB -n 1”

For example, my job ID 1045205 has the LSF output as 1045205.stdout,
1. If your output file is large, you can use grep to find out the CPU time and Run time as following:

$grep "CPU time :" 1045205.stdout

CPU time : 0.28 sec.

$ grep "Run time :" 1045205.stdout

Run time : 121 sec.

CPU utilization = 0.28 /1121 = 0.17% ( my job is only doing sleeping)

2. If you want to see the resource usage summary, you can open your output file and find the following:

Resource usage summary:

CPU time : 0.28 sec.

Max Memory : 5 MB
Average Memory : 4.76 MB
Total Requested Memory : 100.00 MB
Delta Memory : 95.00 MB
Max Swap : -
Max Processes : 4
Max Threads : 5

Run time : 121 sec.

Turnaround time : 119 sec.

General guidelines: alignment with BWA or STAR, the CPU utilizations should be around 90%. If the CPU utilization is unexpectedly low, please report it to us at hpchelp@hpc.mssm.edu.

Theme by Danetsoft and Danang Probo Sayekti inspired by Maksimer