Backfill and Checkpoints

Backfill

Backfill is a new partition added to Discovery. It has access to all the available nodes discovery-c[1-33, 36-38] and discovery-g[1-13, 16] for use. This partition has a maximum wall-time of 14 days and 2 hours. It also has the lowest priority and may be paused multiple times or indefinitely depending on the demand for higher priority jobs.

When to use Backfill?

You use the backfill partition when your computations require more resources than currently available on the normal partition.

How to use Backfill?

In your batch script you need to set the --partition parameter to backfill to use the backfill partition.

#SBATCH --partition backfill

Backfill Drawbacks and Fixes

Drawbacks

If you have a job running on any of the nodes owned by research groups, and a member of a given research group submits a job on the same node, your job will be paused, and requeued. This is because you aren’t a member of the research group therefore, their jobs will be treated with higher priority than yours.

How to secure the progress of your job when it’s paused

When your job is paused in the backfill partition, it will be restarted as soon as the jobs with higher priority are done executing. However, this will cause your job to start from the beginning.

To avoid your job starting all over from the beginning, it’s advisable that you check-point your code/script after every significant computation. This will enable your job continue from the last checkpoint(the computation it had completed before being paused) whenever it’s restarted.

The next section explains what Checkpoint is and how to implement it.

Introduction to Checkpoint

Checkpoint

A layman’s definition would be, saving your work after a large amount of work so that it can be reused later.

From the HPC standpoint, check-pointing means saving your work after every heavy computation, so that it can be resumed later rather than having it start all over again.

Why Checkpoint?

It’s a nice practice to add a checkpoint at the end of that section of your code that performs a heavy computation. For some reason(if interrupted/backfilling), your job is paused or restarted, you wouldn’t have to spend more time redoing the same computation that was done before the interruption occurred.

Goals of Checkpoint

People often work with text editors for writing texts and scripts. Suppose you’ve been typing in a text editor for over an hour without saving and suddenly, your computer freezes or restarts, you are likely to lose all your hour-long writing which would eventually make you start all over again.

Now, assume that you were saving your work every 5 minutes. For the aforementioned scenario, you will only lose a maximum of 5 minutes of work.

Therefore, a checkpoint is nothing but saving your work at certain intervals.

Benefits of using Checkpoint?

Coping with node failure
Debugging
Monitoring
To fit in time constraints
Preemption in case of shared resources

How it works

The whole idea of Checkpointing is to save the state of a given program every time a checkpoint is encountered and restarting from there just in case of any unplanned interruption rather than starting from the beginning.

Checkpointing can often slow down the execution of your program however, it’s still a good practice to add checkpoints after every heavy computation.

Examples

You can make your program checkpoint-able by saving its state in every iteration as well as looking for a state file on startup.

Workflow

Look for a state file.
If found, restore the state, else create the initial state.
Periodically save the state/Required values.

The following examples in various programming languages depict the codes with and without checkpoint implementation.

Python
R
C
Octave
Fortran
MATLAB

Without Checkpoint

#! /bin/env python
from time import sleep

start = 1
end = 10

for i in range(start, end):
  #Heavy Computations
  print(i)
  sleep(1)
# End of program

With Checkpoint

#! /bin/env python
from time import sleep

try:
  # Try to recover current state
  with open ('state', 'r') as file:
    start = int(file.read())
except:
  # Otherwise bootstrap at 1,
  # i.e: starting from begining
  start = 1

end = 10

for i in range(start, end):
  #Heavy Computations
  print(i)
  sleep(1)
  # Save current state
  with open('state', 'w') as file:
    file.write(str(i))
# End of program

Explanation

On the Python code with the checkpoint implementation, at line 4, the beginning of the try block executes the statements in the block.

Lines 6–7 locates the state file and loads the data stored in the state file and saves it in the start integer variable. If the saved file isn’t found, it moves to except block and executes line 11 by initializing the start variable to 1

At lines 18–19, the current state is saved after a heavy computation. In this case, it saves the current value of i, which is the current value of start variable. If the program restarts at this point, the program will find the most recent value of start variable when it executes the try block.

This example only saved the value of start variable. You will need to save the state of your computation at this step as well.

Without Checkpoint

start <- 1
end <- 10

for (i in seq(start, end)) {
  print(i)
  Sys.sleep(1)
}

With Checkpoint

# Try to recover current state
start <- try(load(file="state.RData"))

# Otherwise bootstrap at 1
# i.e: Starting from beginning
if(class(start) == "try-error")
  start <- 1

end <- 10

for (i in seq(start, end)) {
  #Heavy computations
  print(i)
  Sys.sleep(1)

  # Save current values
  save(i, file="state.RData")
}

Explanation

On the R script with the checkpoint implementation, at line 2, the try function loads the saved data from the file state.RData. If it doesn’t find the saved file, R then executes lines 6 by setting the value of the "start" variable to 1.

At line *17*, the script saves the state with all necessary data, after a heavy computation. In this example, only the current value of the `start` variable is being saved. Hence, if the program restarts at this point, the program will find the latest value of `start` when it executes line *2* (try). Note: If you want to save all elements, replace `i` with `ls()` in the `save()` function.

Without Checkpoint

#include <stdio.h>

void main(){
  int i, start, end;
  start = 1;
  end = 10;
  for (int i = start; i <= end; i)
  {
    printf("%dn", i);
    sleep(1);
  }
}

With Checkpoint

#include <stdio.h>

void main(){
  int i, start, end;
  FILE * file;
  // Try to recover current state
  file = fopen("state", "r");
  if (file)
  {
    fscanf(file, "%d", &start);
    fclose(file);

  } else {

    // Otherwise bootsrap at 1
    start = 1;
  }
  end = 10;
  for (i = start; i <= end; i)
  {
    // Heavy computation
    printf("%dn", i);
    sleep(1);
    // Save current state
    file = fopen("state", "w");
    // Save the desired data
    fprintf(file, "%dn", i);
    fclose(file);
  }
}

Explanation

On the C code with the checkpoint implementation, at line 7, the fopen function will try to open and read the data stored in the state file. Then, it returns the read content to the start variable.

At line 8, if the state file isn’t found, then the statements in the else block is executed. In this case, the start variable is set to 1.

At lines 25–28, the state with all necessary data is saved after a heavy computation. (Here it’s saving only the current value of i, which is the current value of the start variable). Hence, if the program restarts at this point, the program will find the latest value of start, when it executes line 8.

Without Checkpoint

the_start = 1;
the_end = 10;
for i = the_start:the_end
  disp(i)
  sleep(1)
end

With Checkpoint

% Try to recover current state
try
  the_start = dmlread('state');
catch
% Otherwise Bootstrap at 1
  the_start = 1;
end_try_catch
the_end = 10;
for i = the_start:the_end
  % Heavy computations
  disp(i)
  sleep(1)
  % Save current state
  dlmwrite('state', i)
end

Explanation

On the Octave code with the checkpoint implementation, at line 2, the try block will try to open and load the data stored in the state file. Then, it gets saved in the_start variable. If this part is successful, the catch block will be skipped. If it doesn’t find the saved file it moves to the catch block and executes the code inside it.

At line 14, the state of the program is saved (ideally with all necessary data) after a heavy computation. It actually saves only the current value of i which will eventually be the current value of the_start variable. Hence, if the program restarts at this point the program will find the latest value of the_start variable when it executes the try block.

Without Checkpoint

program count
  integer :: i, start, end
  start = 1
  end = 10
  do i = start, end
    ! Heavy Computations
    write(*, '(i2)') i
    call sleep(1)
  end do
end program count

With Checkpoint

program count
  integer :: i, start, end
  integer :: n, stat
  ! Try to recover current state
  open (1, file='state', status='old', &
    action='READ', iostat=stat)
  if (stat .eq. 0) then
    read(1,*) start
  else
    ! Otherwise bootsrap at 1
    start = 1
  end if
  close(1)
  end = 10
  do i start, end
    ! Heavy Computations
    write(*, '(i2)') i
    call sleep(1)
    ! Save current state
    open(1, file='state', status='REPLACE', &
    action='WRITE', iostat=stat)
    write(1, '(i2)') i
    close(1)
  end do
end program count

Explanation

On the Fortran code with the checkpoint implementation, at line 5, the open function tries to open and load the state file. If the file is found, then the script saves the value from the file to the start variable else at Line 7. If it doesn’t find the saved file, it moves to the else block and sets the value of start variable to 1.

At lines 20–23, the script should save the state with all necessary data, after a heavy computation. If the program restarts at this point, it will execute line 7 where it reads the latest value from the file and assigns it to the start variable.

Without Checkpoint

s = show(10);
disp(s);

function L=show(i)
    L=zeros;

    for j=1:i
        % Heavy Computation
        L(j) = j;
    end
end

With Checkpoint

s = show(10);
disp(s);

function L=show(i)
    L=zeros;
    filename = 'state.mat';
    if exist(filename, 'file')
      pp = load(filename);
      L = pp.L

    else
      for j=1:i
          % Heavy Computation
          L(j) = j;
      end
      save('state.mat');
  end
end

Explanation

On the MATLAB code with the checkpoint implementation, from line 7 the script is trying to open the saved file. If it finds the file, it assigns the value gotten from the file to the variable L.

At line 8, if the saved file isn’t found, it moves to line 12 and executes line 14, which sets the value of L to 1.

At line 16, the script saves the state with all necessary data, after a heavy computation (here it only saves the current value of L to the state file). If the program restarts at this point, the program will find the latest value of L when it executes line 7.

References

For more explanation and examples, please see the PowerPoint → Checkpointing. It’s from UC Louvain.