[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] using idle computers in computer labs for CFD jobs

Now, this type of checkpoint is distinct from the standard universe's checkpoint, as it's managed internally by the application rather than the standard universe wrapper applied by condor_compile. For Fluent and similar applications which can't be relinked in this way, we need to figure out how to signal Fluent itself to checkpoint periodically.

We expect to be releasing a new developer version (8.5.3) of HTCondor soon, which will contain some experimental features to help simplify situations like this. It sounds like you'd still need to write a wrapper script, but that may be easier than changing the configuration of your execute nodes. At any rate, if you'd like to help test the new features (or are just curious about what they'll probably be), please contact me off-list.

I think that the alternative would have to be having a wrapper script around the Fluent executable which would be able to recognize the eviction signals from HTCondor and create the exit-fluent flag file when such a signal is received.

IIRC, the 'KillSig' job attribute determines which signal is sent on an eviction, so if you'd rather not trap SIGTERM, you can choose something else.

- ToddM