[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] using idle computers in computer labs for CFD jobs



From: Peter Ellevseth <Peter.Ellevseth@xxxxxxxxxx>
Date: 03/01/2016 04:49 AM

> I have not done that much research on checkpointing yet, so forgive my
> ignorance. I just have a question on the concept of checkpointing. Is the
> point to just give some sort of initial value to the job, or does
> checkpointing involve some sort off memory-dump from where the simulation
> can continue?
>
> For example, if running a set of CFD jobs with Ansys, it is possible to
> tell the solver to 'start from this file'. Will that be considere checkpointing?
> From what I read here:
http://condor.eps.manchester.ac.uk/examples/user-
> level-checkpointing-an-example-in-c/ this seems to be the case.
>
> What is then the process of checkpointing? When the job gets a vacate
> signal, will it then run some checkpointing-routine? Or will it allways
> check for checkpoint information when a job starts?

Hi Peter,

Since you said "CFD" and "Ansys," I suppose it's safe to assume that you
mean "Fluent." ;-)

I put some thought into this question just recently as it happens (although
I don't yet have any pools running Fluent), and here's the gist of what I came
up with.

The Fluent docs indicate that a checkpoint is triggered by the presence of a
flag file in /tmp - namely /tmp/check-fluent or /tmp/exit-fluent. When Fluent
checkpoints, it runs to the end of the current iteration and then saves a
"case" and a "data" file containing its forward progress. It then either
continues running or exits, depending on which flag file it found. When it
restarts, if it finds a valid case and data file, it picks up where it left
off.

Of course, the default use of /tmp presumes that there are no other instances
of Fluent running on the machine in question, which would not necessarily be
the case for an HTCondor exec node - they may not even be your own Fluent runs.

This is where the MOUNT_UNDER_SCRATCH knob on Linux comes into play. By
specifying "/tmp" and "/var/tmp" for this config, each job gets its own /tmp
directory, by having /tmp looped back into $_CONDOR_SCRATCH_DIR/tmp. Then
when the flag file is created in /tmp/check-fluent, it will actually be stored
in the job's scratch directory and be visible only to that job.

Now, this type of checkpoint is distinct from the standard universe's checkpoint,
as it's managed internally by the application rather than the standard universe
wrapper applied by condor_compile. For Fluent and similar applications which
can't be relinked in this way, we need to figure out how to signal Fluent itself
to checkpoint periodically.

The _HOOK_UPDATE_JOB_INFO looked to be a good way to do this. There may be
more clever ideas proffered by the talented folks on the list, but we can both
look forward to those. This hook runs eight seconds after startup and once every
five minutes after that, while the job is running on a machine.

In our submit description, we'd have:

+HookKeyword = "FLUENT"

In our pool configuration, we'd have:

FLUENT_HOOK_UPDATE_JOB_INFO = $(LIBEXEC)/fluent_periodic_checkpoint

Our script will then be run eight seconds after startup and every five minutes
after that. Needless to say, we don't want it to trigger a Fluent checkpoint
every five minutes, so we can use the job ClassAd provided to the script to
check the JobCurrentStartDate attribute to see if enough time has elapsed
to take a first checkpoint, and/or look for an existing checkpoint to see
if it's old enough yet.

We could also have it look for a "FluentCheckpointInterval" attribute in the
job ClassAd, so we could say something like this in the submit description:

+FluentCheckpointInterval = 45 * $(MINUTE)

... to tell the hook script that it should checkpoint every 45 minutes. Maybe
it would default to once an hour.

At the appointed time based on the start time or age of the prior case and
data files, the script would simply create the check-fluent file in the job's
loop-mounted /tmp directory and exit, thus triggering the Fluent internal
checkpoint.

To preserve the checkpoint across runs, you'd of course need to set this
parameter in your submit description:

when_to_transfer_files = ON_EXIT_OR_EVICT

This will cause HTCondor to save your scratch directory when eviction occurs,
allowing Fluent to find the previously-created case and data files and pick
up where that checkpoint left off.

This covers periodic checkpointing, but ideally we'd also like to have
on-demand checkpointing as well, so that the job could be instructed to
checkpoint during the eviction process, rather than losing up to as much work
as your checkpoint interval indicates.

At first glance you'd think this could be handled by defining a
"FLUENT_HOOK_EVICT_CLAIM" hook, which instead of creating a "check-fluent"
file, would create the "exit-fluent" file. However unlike the job status hook,
the evict claim hook runs as the ID of the condor_startd, which would usually
be the "condor" user. This means that the hook script wouldn't have access
to the job's scratch directory, and thus couldn't create the flag file in
the scratch-looped /tmp directory.

It's possible to define the Fluent configuration to change the flag
file to some other location, so that might offer some path forward.

I think that the alternative would have to be having a wrapper script
around the Fluent executable which would be able to recognize the eviction
signals from HTCondor and create the exit-fluent flag file when such a signal
is received.

Also if I missed a newer version of the Fluent documentation which indicates that
Fluent can checkpoint in response to a signal rather than the flag files, that
could be another option.

Good luck! Let us know how it works out!

        -Michael V. Pelletier.