Re: [classad-users] Dependencies and repetitions in DAGMan


Date: Fri, 5 Jun 2009 11:33:19 -0400
From: Marc Tardif <marc@xxxxxxxxxxxxx>
Subject: Re: [classad-users] Dependencies and repetitions in DAGMan
* Alain Roy <roy@xxxxxxxxxxx> [2009-06-05 09:52 -0500]:
> This mailing list is for discussion about the Condor ClassAd library,  
> not general Condor issues. You're better off with condor-users. That  
> said, I'll try to answer your questions.

Thanks, I have just subscribed to the condor-users mailing list.

>> 1. In a DAG input file, it seems that the name of the submit  
>> description
>>   filenames given to jobs constitute a unique name when expressing
>>   dependencies. That was a mouthfull, so here's an example:
>>
>>   # Filename: B.dag
>>   JOB A A.condor DONE
>>   JOB B B.condor
>>   PARENT A CHILD B
>>
>>   So, my understanding is that job B will only run once all jobs
>>   described by A.condor are completed. For example, lets say the
>>   following submit files were enqueued:
>>
>>   1. A.condor
>>   2. B.dag
>>   3. A.condor
>>
>>   Then, would B.dag only run once #1 is completed or once all
>>   submits matching A.condor are completed or is there something
>>   I don't understand?
>
> I'm not sure I follow, but...
>
> a) DAGMan expects that A.condor will submit exactly one job--no more, no 
> less. So your question about "all submits matching A.condor" doesn't 
> quite make sense.
>
> b) DAGMan doesn't look at other jobs that have been submitted, it only  
> tracks the jobs it's submitted. So DAGMan will only submit A.condor,  
> then B.condor. If A or B was separately submitted, DAGMan won't notice  
> or pay attention. It submits the jobs, then tracks them by their job id.

Thanks for the clarification, my original question didn't make sense indeed
because I was making false assumptions. Now, the way dependencies are
handled by DAGMan makes more sense.

>> 2. Is there a way to express either a submit description file or a DAG
>>   input file so that an executable is run on each node in a cluster
>>   only once? If not, must I enqueue a submit description file for
>>   each node with something like:
>>
>>   requirements = other.hostname == 'foo'
>>
>>   And so forth for each host. (Note that "hostname" probably isn't
>>   part of ClassAd, but I mean anything that uniquely identifies each
>>   node in a cluster)
>
> There's no easy way to say "once for each host". Even ignoring the fact 
> that the list of hosts is dynamic, there isn't a way. I would probably 
> script this: make a script that takes a list of hostnames (or discovers 
> it), then submits one job for each host with requirements set to run 
> there. You could cluster them in a DAG, if you like, though it's not 
> necessary.

I just read some PPT presentation which gave me the impression it might
be possible to use a PRE script in a nested DAG which would call
condor_submit_day in order to generate nodes dynamically. For example,
lets say I specified the following DAGs:

    # Filename: A.dag
    JOB A A.condor
    JOB B B.dag
    PARENT A CHILD B

    # Filename: B.dag
    Script PRE B loop-script
    JOB B B.condor

Then, my understanding is that it might be possible to build loop-script in
such a way that it could discover the available hosts and automatically
submit B.condor for each host. Is my understanding correct?

>> 3. Would it be possible to remove a resource provider (a machine) from
>>   a cluster but only once the current jobs have completed as well as
>>   all the other dependent jobs as defined by the pending DAG input
>>   files? For example, here's an example:
>>
>>   # Filename: A.dag
>>   JOB A A.condor
>>   JOB B B.condor
>>   PARENT A CHILD B
>>
>>   So, if a node is in the middle of running job A, I would like to be
>>   notified somehow when job B has completed. However, I don't  
>> necessarily
>>   want to hard code that I'm waiting for job B to complete, I would  
>> rather
>>   express abstractly: tell me when the current jobs and dependents  
>> have
>>   completed.
>
> I'm a bit confused what you're asking for, but you can tell Condor to  
> turn off on a machine by using "condor_off -peaceful". It will wait for 
> the running jobs to finish. Make sense?
>
> This is outside of DAGMan--I'm a bit confused by how you're combining  
> DAGMan with removing machines from a cluster.

Sorry for the confusion, I was mostly concerned with a mechanism to
determine when a machine was finished running a job instead of having
the job interrupted and migrated to another machine. Turning off a
machine or removing it from the resource pool should be trivial and,
indeed, not related to DAGMan :)

Thanks,
Marc

[← Prev in Thread] Current Thread [Next in Thread→]