[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] using idle computers in computer labs for CFD jobs

I've found that quite a few compute intensive tools with long-running scenarios have self-checkpointing capabilities built in, even if it's only to pick up where it left off in a batch of independent runs - which is, naturally, of limited use when you split up the batch into one-run jobs and submit it to HTCodnor to run all of them at the same time.

I'm not sure if it's what you're using, but here's some information on self-checkpointing for ANSYS Fluent jobs, on page 39:


They mention native LSF and SGE integration, but also indicate that you can checkpoint a running Fluent job by creating a /tmp/check-fluent file. You can checkpoint and exit ("vacate" in HTCondorese) by creating /tmp/exit-fluent.

With HTCondor on Linux and the MOUNT_UNDER_SCRATCH option, you can bind-mount a tmp and var/tmp directory in the job's scratch directory so that each job has its own /tmp and /var/tmp. This means that only a single slot would be affected by creation of a /tmp/check-fluent file in the job's context, since it would be in ${_CONDOR_SCRATCH_DIR}/tmp/check-fluent.

It would be easy enough to write a wrapper which traps the HTCondor checkpointing or soft-kill signal and creates the appropriate file for Fluent - SIGSTP would be tmp/exit-fluent, and SIGUSR2 would be tmp/check-fluent (see p.475 in the 8.2.9 manual), and the soft-kill signal defaults to SIGTERM in vanilla.

Fluent defaults to finishing the current iteration, but can also be directed to complete all iterations in the current time-step before checkpointing which would potentially take longer, so you'd want to set your timeouts in HTCondor (i.e., max vacate time) to insure it has enough time to finish a checkpoint.


Michael V. Pelletier
IT Program Execution
Principal Engineer
978.858.9681 (5-9681) NOTE NEW NUMBER
339.293.9149 cell
339.645.8614 fax