Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] using idle computers in computer labs for CFD jobs

Date: Mon, 19 Oct 2015 12:25:22 -0400
From: Michael V Pelletier <Michael.V.Pelletier@xxxxxxxxxxxx>
Subject: Re: [HTCondor-users] using idle computers in computer labs for CFD jobs

I've found that quite a few compute intensive tools with long-running scenarios have self-checkpointing capabilities built in, even if it's only to pick up where it left off in a batch of independent runs - which is, naturally, of limited use when you split up the batch into one-run jobs and submit it to HTCodnor to run all of them at the same time.

I'm not sure if it's what you're using, but here's some information on self-checkpointing for ANSYS Fluent jobs, on page 39:

https://uiuc-cse.github.io/me498cm-fa15/lessons/fluent/refs/ANSYS%20Fluent%20Getting%20Started%20Guide.pdf

They mention native LSF and SGE integration, but also indicate that you can checkpoint a running Fluent job by creating a /tmp/check-fluent file. You can checkpoint and exit ("vacate" in HTCondorese) by creating /tmp/exit-fluent.

With HTCondor on Linux and the MOUNT_UNDER_SCRATCH option, you can bind-mount a tmp and var/tmp directory in the job's scratch directory so that each job has its own /tmp and /var/tmp. This means that only a single slot would be affected by creation of a /tmp/check-fluent file in the job's context, since it would be in ${_CONDOR_SCRATCH_DIR}/tmp/check-fluent.

It would be easy enough to write a wrapper which traps the HTCondor checkpointing or soft-kill signal and creates the appropriate file for Fluent - SIGSTP would be tmp/exit-fluent, and SIGUSR2 would be tmp/check-fluent (see p.475 in the 8.2.9 manual), and the soft-kill signal defaults to SIGTERM in vanilla.

Fluent defaults to finishing the current iteration, but can also be directed to complete all iterations in the current time-step before checkpointing which would potentially take longer, so you'd want to set your timeouts in HTCondor (i.e., max vacate time) to insure it has enough time to finish a checkpoint.


	Michael V. Pelletier IT Program Execution Principal Engineer 978.858.9681 (5-9681) NOTE NEW NUMBER 339.293.9149 cell 339.645.8614 fax michael.v.pelletier@xxxxxxxxxxxx

References:
- Re: [HTCondor-users] using idle computers in computer labs for CFD jobs
  - From: Ian Cottam

Prev by Date: Re: [HTCondor-users] using idle computers in computer labs for CFD jobs
Next by Date: Re: [HTCondor-users] No resources matched request's constraints
Previous by thread: Re: [HTCondor-users] using idle computers in computer labs for CFD jobs
Next by thread: Re: [HTCondor-users] using idle computers in computer labs for CFD jobs
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [HTCondor-users] using idle computers in computer labs for CFD jobs