Backfill and Checkpoints
Backfill
Backfill is a new partition added to Discovery. It has access to all the available nodes discovery-c[1-33, 36-38] and discovery-g[1-13, 16] for use. This partition has a maximum wall-time of 14 days and 2 hours. It also has the lowest priority and may be paused multiple times or indefinitely depending on the demand for higher priority jobs.
When to use Backfill?
You use the backfill partition when your computations require more resources than currently available on the normal partition.
How to use Backfill?
In your batch script you need to set the --partition
parameter to backfill
to use the backfill partition.
#SBATCH --partition backfill
Backfill Drawbacks and Fixes
Drawbacks
If you have a job running on any of the nodes owned by research groups, and a member of a given research group submits a job on the same node, your job will be paused, and requeued. This is because you aren’t a member of the research group therefore, their jobs will be treated with higher priority than yours.
How to secure the progress of your job when it’s paused
When your job is paused in the backfill partition, it will be restarted as soon as the jobs with higher priority are done executing. However, this will cause your job to start from the beginning.
To avoid your job starting all over from the beginning, it’s advisable that you check-point your code/script after every significant computation. This will enable your job continue from the last checkpoint(the computation it had completed before being paused) whenever it’s restarted.
The next section explains what Checkpoint is and how to implement it.
Introduction to Checkpoint
Checkpoint
A layman’s definition would be, saving your work after a large amount of work so that it can be reused later.
From the HPC standpoint, check-pointing means saving your work after every heavy computation, so that it can be resumed later rather than having it start all over again.
Why Checkpoint?
It’s a nice practice to add a checkpoint at the end of that section of your code that performs a heavy computation. For some reason(if interrupted/backfilling), your job is paused or restarted, you wouldn’t have to spend more time redoing the same computation that was done before the interruption occurred.
Goals of Checkpoint
-
People often work with text editors for writing texts and scripts. Suppose you’ve been typing in a text editor for over an hour without saving and suddenly, your computer freezes or restarts, you are likely to lose all your hour-long writing which would eventually make you start all over again.
Now, assume that you were saving your work every 5 minutes. For the aforementioned scenario, you will only lose a maximum of 5 minutes of work.
Therefore, a checkpoint is nothing but saving your work at certain intervals.
Benefits of using Checkpoint?
-
Coping with node failure
-
Debugging
-
Monitoring
-
To fit in time constraints
-
Preemption in case of shared resources
How it works
-
The whole idea of Checkpointing is to save the state of a given program every time a checkpoint is encountered and restarting from there just in case of any unplanned interruption rather than starting from the beginning.
Checkpointing can often slow down the execution of your program however, it’s still a good practice to add checkpoints after every heavy computation.
Examples
You can make your program checkpoint-able by saving its state in every iteration as well as looking for a state file on startup.
Workflow
-
Look for a state file.
-
If found, restore the state, else create the initial state.
-
Periodically save the state/Required values.
The following examples in various programming languages depict the codes with and without checkpoint implementation.
Without Checkpoint
#! /bin/env python
from time import sleep
start = 1
end = 10
for i in range(start, end):
#Heavy Computations
print(i)
sleep(1)
# End of program
With Checkpoint
#! /bin/env python
from time import sleep
try:
# Try to recover current state
with open ('state', 'r') as file:
start = int(file.read())
except:
# Otherwise bootstrap at 1,
# i.e: starting from begining
start = 1
end = 10
for i in range(start, end):
#Heavy Computations
print(i)
sleep(1)
# Save current state
with open('state', 'w') as file:
file.write(str(i))
# End of program
Explanation
On the Python code with the checkpoint implementation, at line 4, the beginning of the try
block executes the statements in the block.
Lines 6–7 locates the state file and loads the data stored in the state file and saves it in the start
integer variable. If the saved file isn’t found, it moves to except
block and executes line 11 by initializing the start
variable to 1
At lines 18–19, the current state is saved after a heavy computation. In this case, it saves the current value of i
, which is the current value of start
variable. If the program restarts at this point, the program will find the most recent value of start
variable when it executes the try
block.
This example only saved the value of start
variable. You will need to save the state of your computation at this step as well.
Without Checkpoint
start <- 1
end <- 10
for (i in seq(start, end)) {
print(i)
Sys.sleep(1)
}
With Checkpoint
# Try to recover current state
start <- try(load(file="state.RData"))
# Otherwise bootstrap at 1
# i.e: Starting from beginning
if(class(start) == "try-error")
start <- 1
end <- 10
for (i in seq(start, end)) {
#Heavy computations
print(i)
Sys.sleep(1)
# Save current values
save(i, file="state.RData")
}
Explanation
On the R script with the checkpoint implementation, at line 2, the try
function loads the saved data from the file state.RData
. If it doesn’t find the saved file, R then executes lines 6 by setting the value of the "start" variable to 1
.
At line *17*, the script saves the state with all necessary data, after a heavy computation. In this example, only the current value of the `start` variable is being saved. Hence, if the program restarts at this point, the program will find the latest value of `start` when it executes line *2* (try). Note: If you want to save all elements, replace `i` with `ls()` in the `save()` function.
Without Checkpoint
#include <stdio.h>
void main(){
int i, start, end;
start = 1;
end = 10;
for (int i = start; i <= end; i)
{
printf("%dn", i);
sleep(1);
}
}
With Checkpoint
#include <stdio.h>
void main(){
int i, start, end;
FILE * file;
// Try to recover current state
file = fopen("state", "r");
if (file)
{
fscanf(file, "%d", &start);
fclose(file);
} else {
// Otherwise bootsrap at 1
start = 1;
}
end = 10;
for (i = start; i <= end; i)
{
// Heavy computation
printf("%dn", i);
sleep(1);
// Save current state
file = fopen("state", "w");
// Save the desired data
fprintf(file, "%dn", i);
fclose(file);
}
}
Explanation
On the C code with the checkpoint implementation, at line 7, the fopen
function will try to open and read the data stored in the state file. Then, it returns the read content to the start
variable.
At line 8, if the state file isn’t found, then the statements in the else block is executed. In this case, the start
variable is set to 1
.
At lines 25–28, the state with all necessary data is saved after a heavy computation. (Here it’s saving only the current value of i
, which is the current value of the start
variable). Hence, if the program restarts at this point, the program will find the latest value of start
, when it executes line 8.
Without Checkpoint
the_start = 1;
the_end = 10;
for i = the_start:the_end
disp(i)
sleep(1)
end
With Checkpoint
% Try to recover current state
try
the_start = dmlread('state');
catch
% Otherwise Bootstrap at 1
the_start = 1;
end_try_catch
the_end = 10;
for i = the_start:the_end
% Heavy computations
disp(i)
sleep(1)
% Save current state
dlmwrite('state', i)
end
Explanation
On the Octave code with the checkpoint implementation, at line 2, the try
block will try to open and load the data stored in the state file. Then, it gets saved in the_start
variable. If this part is successful, the catch
block will be skipped. If it doesn’t find the saved file it moves to the catch
block and executes the code inside it.
At line 14, the state of the program is saved (ideally with all necessary data) after a heavy computation. It actually saves only the current value of i
which will eventually be the current value of the_start
variable. Hence, if the program restarts at this point the program will find the latest value of the_start
variable when it executes the try
block.
Without Checkpoint
program count
integer :: i, start, end
start = 1
end = 10
do i = start, end
! Heavy Computations
write(*, '(i2)') i
call sleep(1)
end do
end program count
With Checkpoint
program count
integer :: i, start, end
integer :: n, stat
! Try to recover current state
open (1, file='state', status='old', &
action='READ', iostat=stat)
if (stat .eq. 0) then
read(1,*) start
else
! Otherwise bootsrap at 1
start = 1
end if
close(1)
end = 10
do i start, end
! Heavy Computations
write(*, '(i2)') i
call sleep(1)
! Save current state
open(1, file='state', status='REPLACE', &
action='WRITE', iostat=stat)
write(1, '(i2)') i
close(1)
end do
end program count
Explanation
On the Fortran code with the checkpoint implementation, at line 5, the open
function tries to open and load the state file. If the file is found, then the script saves the value from the file to the start
variable else at Line 7. If it doesn’t find the saved file, it moves to the else block and sets the value of start
variable to 1
.
At lines 20–23, the script should save the state with all necessary data, after a heavy computation. If the program restarts at this point, it will execute line 7 where it reads the latest value from the file and assigns it to the start
variable.
Without Checkpoint
s = show(10);
disp(s);
function L=show(i)
L=zeros;
for j=1:i
% Heavy Computation
L(j) = j;
end
end
With Checkpoint
s = show(10);
disp(s);
function L=show(i)
L=zeros;
filename = 'state.mat';
if exist(filename, 'file')
pp = load(filename);
L = pp.L
else
for j=1:i
% Heavy Computation
L(j) = j;
end
save('state.mat');
end
end
Explanation
On the MATLAB code with the checkpoint implementation, from line 7 the script is trying to open the saved file. If it finds the file, it assigns the value gotten from the file to the variable L
.
At line 8, if the saved file isn’t found, it moves to line 12 and executes line 14, which sets the value of L
to 1
.
At line 16, the script saves the state with all necessary data, after a heavy computation (here it only saves the current value of L
to the state file). If the program restarts at this point, the program will find the latest value of L
when it executes line 7.
References
For more explanation and examples, please see the PowerPoint → Checkpointing. It’s from UC Louvain.