Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] dagman capabilities

Date: Mon, 05 Oct 2009 12:11:01 -0500
From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
Subject: Re: [Condor-users] dagman capabilities

michael bane wrote:

I was looking at 'job recovery: the rescue DAG' in the online Condormanual (2.10.6) but couldn't decide is DAGman was capable of handlingthe situation of submitting N jobs (embarrassingly parallel, say) tothe Vanilla universe (since we cannot link to the checkpointing forStandard) and then resubmitting those which are killed (eg due to pre-emption by work on the given nodes)?

No need for DAGMan to do this... Condor itself will automaticallyrestart a Vanilla universe job that fails to complete. By "complete", Imean the job executable exits of its own accord.

So if you just need to submit a group of N jobs and have them resilientto preemption, crashes, etc, just use condor_submit. No need for usingcondor_submit_dag.

On the other hand, if you have inter-job dependencies (i.e. after your Nvanilla jobs complete successfully, you want another job to besubmitted), then you want condor_submit_dag.


regards,
Todd

--
Todd Tannenbaum                       University of Wisconsin-Madison
Condor Project Research               Department of Computer Sciences
tannenba@xxxxxxxxxxx                  1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132                 Madison, WI 53706-1685

References:
- [Condor-users] dagman capabilities
  - From: michael bane

Prev by Date: [Condor-users] help with output transferring
Next by Date: Re: [Condor-users] GSI Authentication failure in condor
Previous by thread: [Condor-users] dagman capabilities
Next by thread: [Condor-users] help with output transferring
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [Condor-users] dagman capabilities