You are here

Checkpoint Restart (BLCR)

First:

  • You can checkpoint programs only. You cannot checkpoint a complete job only a program within the job.
  • The program you wish to checkpoint must have been enabled for checkpointing at startup. Programs are not enabled by default and cannot be enabled retroactively.
  • All files open at the time of a checkpoint must exist at their original locations for the restart. In order for a job to be restarted on a node other than the one it was checkpointed on, the files should be in /sc/orga or your $HOME directory. I.e., a file system mounted on all compute nodes. Do not put files in /tmp
  • Be sure you restart a program on the same type of node as it was checkpointed on. E.g. checkpointed on Manda node, restart on Manda node.
  • Checkpointing of MPI programs is no longer supported.
  • Applications which have open sockets or use SystemV IPC mechanisms including shared memory, semaphores and message queues also cannot be checkpointed.
  • Multi-process jobs communicating using pipes can be checkpointed.

Checkpointing

Checkpointing is the process by which the operating system takes a snapshot of a running program along with its environment and saves it to a disk file. After a checkpoint, the program can either be left to continue running or terminated. One can take multiple checkpoints of a running program, typically at some set time interval. This file can then be used by the operating system to recreate the environment of the program and allow the program to resume execution from the time the checkpoint was taken. However, the environment must be exactly as it was at the time of checkpoint or else the program will not restart. The most common problems are:

  • a missing or moved file. A file that has been written to after a checkpoint is OK; just don't move it.
  • restarting a program on a different kind of node that the one that origially was running the program

The underlying software used is the Berkeley Lab Checkpoint/Restart implementation. You may see this referred to as BLCR checkpointing. The underlying commands are cr_run, cr_checkpoint and cr_restart. You can use man to learn more about these commands and the various options. To read about all the gory details visit the Berkeley Lab web site.

One can manually run checkpoint which will require you to issue the cr_checkpoint command yourself or use the LSF facility to manage the checkpoints. Only this latter method will be discussed here.

Checkpointing on Minerva

This can be done with little or no changes to a dynamically linked executable intended to be run on a single node. The executable can be either a single thread application or a multi-threaded SMP application built with OpenMP or pthreads. Your program is dynamically linked if you or the original author of a downloaded binary, did not use the -static compiler option on either the gcc or INTEL compiler.

You can check that your program is dynamically linked by using the file command. E.g.:

file ckpttest

ckpttest: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.18, not stripped

Now that you know your program can be checkpointed, you can submit an LSF job and specify when to take checkpoints and where to store them. Setting up for checkpointing using LSF involves using bsub's -k option:

-k "checkpoint_dir [init=initial_checkpoint_period] [checkpoint_period] [method=method_name]"
  • (REQUIRED)checkpoint_dir - folder were checkpoints are to be stored. Will be created if needed.
  • (optional)initial_checkpoint_period - wait this many minutes before taking the first checkpoint
  • (optional)checkpoint_period - checkpoint every this many minutes after the initial checkpoint is taken.
  • (optional)method_name - BUT IF SPECIFIED MUST BE: blcr

>

Other bsub options are not affected.

If the optional "period" arguments are not specified, the program will not automatically checkpoint and the bchkpnt command must be used. bchkpnt can be used to checkpoint the program on demand or set or change the periodicity checkpoints. See the man page for a complete discussion of the options.

Example

Suppose you have a program that is expected to take 2 weeks to complete or, perhaps, you are not entirely sure of how much memory it will take. The duration is longer than the maximum allowed for a job in our queues. Specifying too much memory will delay start of the job. A solution would be to submit the job and request checkpointing be done so that you can restart it after it times out or crashes for insufficient memory.

Submitting an LSF job that uses checkpointing is essentially the same as submitting any other job except that:

  1. The -k option must be specified.
  2. You must start the program execution by using the cr_run command.

So, in this example:

  • The checkpoint files will be written to a sub-folder named chkpt_dir in the submission folder. A The checkpoint file can be quite big and depends on the size of your running program.
  • The first checkpoint will be made after 1 minute. This will quickly check that the parameters are set up properly.
  • Subsequent checkpoints will be made every 20 hours ( 1200 minutes) afterwards.
  • The blcr method of checkpointing will be used.
  • There are 4 cores running this program on one node
  • The job will end after 144 hours. You will lose about 4 hours of compute if the job times out. That is the approximate time the program will continue running after the last checkpoint.
 #!/bin/bash
 #	       myJob.lsf
 #BSUB -P acc_hpcstaff
 #BSUB -k "chkpt_dir init=1 1200 method=blcr"
 #BSUB -n 1
 #BSUB -R affinity[core(4)]
 #BSUB -oo %J.out
 #BSUB -eo %J.err
 #BSUB -q premium
 #BSUB -W 144:00

 module load gcc

 cr_run ./myLoooooongProgram> test.out 2>test.err

Submit the job:

 [user01@login1 lsf]$ bsub  is submitted to queue <premium>.

You can now go out and teach a croquet course or compete in a whist tournament.

When you come back, something happened! Machine crashed, wall clock time exceeded, or whatever. Not to worry.

If you look in the submission folder, you will see a subfolder named chkpt_dir and under that another subfolder with the job number. and under that, the files needed to restart and keep track of your job. 96610603 is the job number of the checkpointed job. The context.31117 file is the actual checkpoint. 31117 is the PID of the process that was checkpointed.

$ ls chkpt_dir
96610603

$ ls chkpt_dir/96610603/
1516635926.96610603 1516635926.96610603.out chklog
1516635926.96610603.err 1516635926.96610603.shell context.31117

Restart

To restart the job, use the brestart command. ( Check the man page for the various options. ) The most common options will be discussed below.

brestart [bsub_options] [-f] checkpoint_dir [job_ID | "job_ID[index]"]

The checkpoint_dir and the job_ID are required. checkpoint_dir is the path to the folder specified on the original -k option. It can be relative to the folder from which the brestart command is issued or a fully qualified path name. job_ID is the job number that you want to restart.

You can change/specify certain bsub options on restart ( see man page for full list). By default, jobs are restarted with the same output file and file transfer specifications, job name, window signal value, check-point directory and period, and rerun options as the original job.

Wall clock time, -W, is always required by our system. This would be the wall clock time for the restarted job. For example, specifying

 brestart -W 30:00 .... 

would mean that the job will be restarted and run for 30 more hours.

It is highly recommended that you use the -m option to specify that you want the type of node that was used during the original execution. If the job started on a manda node, you want to be sure it restarts on a manda node.

If your job failed because of too little memory, you can use the -M option to increase the amount of memory for the program. Remember, that when the program was checkpointed, there was enough memory so the program will restart from that point. When additional memory is requested, the new value will be used.

Theme by Danetsoft and Danang Probo Sayekti inspired by Maksimer