[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Long running OSG jobs



Hello,

I have some long-running (>24 hour) jobs I would like to deploy on the open science grid.

The jobs self-checkpoint (by writing out simple text files to disk) every 60 minutes and when they receive an interrupt signal.

I have enabled file transfer in the submission file, and have added the following lines for a periodic hold/release in the hope that the jobs will:
1) get evicted
2) transfer their working directory (which contains the checkpoint files) back to the submission host
3) resume under condor and re-send that working directory to the new worker node
4) identify the presence of the checkpoint files and cleanly resume from where they left off

However, the jobs do not appear to transfer the data back in this scenario. I have also tried condor_rm, which I would expect to terminate the job and send the non-empty working directory back. This also fails to achieve the desired effect.

Some pertinent details from the submission file:

should_transfer_files = YESÂ
transfer_output_files = $(macrooutputDir)
transfer_input_files = datafind,$(macrooutputDir)
when_to_transfer_output = ON_EXIT_OR_EVICT
periodic_hold = (JobStatus == 2) && (time() - EnteredCurrentStatus > 8*3600)
periodic_hold_subcode = 12345
periodic_release = (JobStatus == 5) && (time() - EnteredCurrentStatus > 5*60) && (PeriodicHoldSubCode =?= 12345)
want_graceful_removal = true

where $(macrooutputDir) is the name of each job's working directory, as specified in the dagman file.

Any advice would be greatly appreciated,
Many thanks,
James

--
===========================================
James Clark
Research Scientist

Center for Relativistic Astrophysics
School of Physics
Georgia Institute of Technology
Atlanta GA 30332
office: Boggs 1-110
Tel. (cell): Â413-230-1412
Skype: jamesclark_
===========================================