* Alain Roy <roy@xxxxxxxxxxx> [2009-06-05 09:52 -0500]:
> This mailing list is for discussion about the Condor ClassAd library,
> not general Condor issues. You're better off with condor-users. That
> said, I'll try to answer your questions.
Thanks, I have just subscribed to the condor-users mailing list.
>> 1. In a DAG input file, it seems that the name of the submit
>> description
>> filenames given to jobs constitute a unique name when expressing
>> dependencies. That was a mouthfull, so here's an example:
>>
>> # Filename: B.dag
>> JOB A A.condor DONE
>> JOB B B.condor
>> PARENT A CHILD B
>>
>> So, my understanding is that job B will only run once all jobs
>> described by A.condor are completed. For example, lets say the
>> following submit files were enqueued:
>>
>> 1. A.condor
>> 2. B.dag
>> 3. A.condor
>>
>> Then, would B.dag only run once #1 is completed or once all
>> submits matching A.condor are completed or is there something
>> I don't understand?
>
> I'm not sure I follow, but...
>
> a) DAGMan expects that A.condor will submit exactly one job--no more, no
> less. So your question about "all submits matching A.condor" doesn't
> quite make sense.
>
> b) DAGMan doesn't look at other jobs that have been submitted, it only
> tracks the jobs it's submitted. So DAGMan will only submit A.condor,
> then B.condor. If A or B was separately submitted, DAGMan won't notice
> or pay attention. It submits the jobs, then tracks them by their job id.
Thanks for the clarification, my original question didn't make sense indeed
because I was making false assumptions. Now, the way dependencies are
handled by DAGMan makes more sense.
>> 2. Is there a way to express either a submit description file or a DAG
>> input file so that an executable is run on each node in a cluster
>> only once? If not, must I enqueue a submit description file for
>> each node with something like:
>>
>> requirements = other.hostname == 'foo'
>>
>> And so forth for each host. (Note that "hostname" probably isn't
>> part of ClassAd, but I mean anything that uniquely identifies each
>> node in a cluster)
>
> There's no easy way to say "once for each host". Even ignoring the fact
> that the list of hosts is dynamic, there isn't a way. I would probably
> script this: make a script that takes a list of hostnames (or discovers
> it), then submits one job for each host with requirements set to run
> there. You could cluster them in a DAG, if you like, though it's not
> necessary.
I just read some PPT presentation which gave me the impression it might
be possible to use a PRE script in a nested DAG which would call
condor_submit_day in order to generate nodes dynamically. For example,
lets say I specified the following DAGs:
# Filename: A.dag
JOB A A.condor
JOB B B.dag
PARENT A CHILD B
# Filename: B.dag
Script PRE B loop-script
JOB B B.condor
Then, my understanding is that it might be possible to build loop-script in
such a way that it could discover the available hosts and automatically
submit B.condor for each host. Is my understanding correct?
>> 3. Would it be possible to remove a resource provider (a machine) from
>> a cluster but only once the current jobs have completed as well as
>> all the other dependent jobs as defined by the pending DAG input
>> files? For example, here's an example:
>>
>> # Filename: A.dag
>> JOB A A.condor
>> JOB B B.condor
>> PARENT A CHILD B
>>
>> So, if a node is in the middle of running job A, I would like to be
>> notified somehow when job B has completed. However, I don't
>> necessarily
>> want to hard code that I'm waiting for job B to complete, I would
>> rather
>> express abstractly: tell me when the current jobs and dependents
>> have
>> completed.
>
> I'm a bit confused what you're asking for, but you can tell Condor to
> turn off on a machine by using "condor_off -peaceful". It will wait for
> the running jobs to finish. Make sense?
>
> This is outside of DAGMan--I'm a bit confused by how you're combining
> DAGMan with removing machines from a cluster.
Sorry for the confusion, I was mostly concerned with a mechanism to
determine when a machine was finished running a job instead of having
the job interrupted and migrated to another machine. Turning off a
machine or removing it from the resource pool should be trivial and,
indeed, not related to DAGMan :)
Thanks,
Marc
|