BLCR support in GASNet
======================

=====================================================================
NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE
NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE
=====================================================================

GASNet's BLCR integration support is now DEPRECATED.
No further development of this feature is anticipated.
It may be removed in a future release.

=====================================================================
NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE
NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE
=====================================================================

BLCR is "Berkeley Lab Checkpoint/Restart", a system-level checkpointer for
Linux.  For more information on BLCR, visit http://ftg.lbl.gov/checkpoint

Currently there is support for BLCR-based process checkpoints with the
following GASNet conduits under the proper conditions:
  udp-conduit
  ibv-conduit

Only self-initiated collective checkpoints and collective restarts are
fully supported at this time.

Installation:

  Common to all supported conduit is the need to disable PSHM (intranode
  shared memory), and enable BLCR support:
        --disable-pshm --enable-blcr
  You may also need
        --with-blcr-home=/path/to/install
  if BLCR is not installed in /usr/include and /usr/lib
  For ibv-conduit (InfiniBand) one must also ensure ssh is used for job
  spawning:
        --with-ibv-spawner=ssh
  Once configured as described above, there are no special build or
  installation steps required.

Use:

  One can initiate a checkpoint in GASNet code with the collective call
         int gasnet_all_checkpoint(const char *dir);
  where 'dir' is the name of a directory to contain the checkpoint context
  files, or NULL for the default location (described below).

  If the application processes are multi-threaded then the checkpoint
  initiation call must be made by exactly one thread.

Running:

  Since only self-initiated checkpoints are supported, nothing extra
  is needed when running the application.  However, you may want to
  set one of the GASNET_CHECKPOINT_{DIR,BASEDIR} environment variables
  (see below) if you don't want the default location for checkpoint files.

Restart:

  The instructions differ only slightly by conduit:
    + udp-conduit:
      $ amudprun -restart [checkpoint-directory]
    + ibv-conduit
      $ gasnetrun_ibv -restart [checkpoint-directory]
  For both udp- and ibv-conduit, the GASNET_SSH_SERVERS environment
  variable, which determines the nodes on which to run, will be honored at
  restart and may be different than in the original run.  It is not
  necessary to pass the node count, since that is automatically the same
  as the orginal run.

Rollback:

  Currently only ibv-conduit supports self-initiated rollback.  To initiate
  a rollback of zero or more processes, all process must make the same
  collective call:
         int gasnet_all_rollback(const char *dir);
  where 'dir' gives the directory containing the checkpoint to which the
  caller will be rolled-back.  Any process passing NULL will continue to
  run forward after the call, but must still participate in the call to
  allow GASNet to establish communications with the rolled-back
  processes.  Each caller may specify a different directory (or NULL).
  However, GASNet does not verify that the arguments together define a
  valid recovery line.

Location of checkpoint files:

  In order of precedence (highest to lowest):
  1. If non-NULL, use DIR argument pased to gasnet_all_checkpoint(DIR)
  2. ${GASNET_CHECKPOINT_JOBDIR}/[SEQ]
  3. ${GASNET_CHECKPOINT_BASEDIR}/[GUID]/[SEQ]
  4. ${HOME}/gasnet-checkpoint/[GUID]/[SEQ]
     The GUID value is generated automatically for each run, and SEQ is a
     sequence number (0 for the first checkpoint, 1 for the next, etc.).

Limitations of ibv-conduit:

  For ibv-conduit, only ssh-based spawning has restart support.
  While jobs spawned using mpi or pmi will appear to take valid
  checkpoints, they cannot be restarted from those checkpoints.

Limitations of udp-conduit:

  None currently known.

Future work:

  + AMUDP to use libcr instead of the cr_restart utility.
  + Self-initiated, NON-collective checkpoints
    (to be called by coordination code provided by others).
  + A mechanism like "Job-pause" to provide a trivial coordination
    algorithm for use in the single-process migration scenario
  + A mechanism for internally-initiated restart of a dead process.
  + External initiation of collective checkpoints with a built-in
    coordination algorithm (most likely just sync-and-stop).
