[HTCondor-users] Long running OSG jobs

Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

Hello,

I have some long-running (>24 hour) jobs I would like to deploy on the open science grid.

The jobs self-checkpoint (by writing out simple text files to disk) every 60 minutes and when they receive an interrupt signal.

I have enabled file transfer in the submission file, and have added the following lines for a periodic hold/release in the hope that the jobs will:

1) get evicted

2) transfer their working directory (which contains the checkpoint files) back to the submission host

3) resume under condor and re-send that working directory to the new worker node

4) identify the presence of the checkpoint files and cleanly resume from where they left off

However, the jobs do not appear to transfer the data back in this scenario.Â I have also tried condor_rm, which I would expect to terminate the job and send the non-empty working directory back.Â This also fails to achieve the desired effect.

Some pertinent details from the submission file:

should_transfer_files = YESÂ

transfer_output_files = $(macrooutputDir)

transfer_input_files = datafind,$(macrooutputDir)

when_to_transfer_output = ON_EXIT_OR_EVICT

periodic_hold = (JobStatus == 2) && (time() - EnteredCurrentStatus > 8*3600)

periodic_hold_subcode = 12345

periodic_release = (JobStatus == 5) && (time() - EnteredCurrentStatus > 5*60) && (PeriodicHoldSubCode =?= 12345)

want_graceful_removal = true

where $(macrooutputDir) is the name of each job's working directory, as specified in the dagman file.

Any advice would be greatly appreciated,

Many thanks,

James

===========================================

James Clark

Research Scientist

Center for Relativistic Astrophysics

School of Physics

Georgia Institute of Technology

Atlanta GA 30332

office: Boggs 1-110

email: Âjames.clark@xxxxxxxxxxxxxxxxxx

Tel. (cell): Â413-230-1412

Skype: jamesclark_

===========================================

Mailing List Archives

Public Access

[HTCondor-users] Long running OSG jobs