Hi,
This mailing list is for discussion about the Condor ClassAd library,
not general Condor issues. You're better off with condor-users. That
said, I'll try to answer your questions.
1. In a DAG input file, it seems that the name of the submit
description
filenames given to jobs constitute a unique name when expressing
dependencies. That was a mouthfull, so here's an example:
# Filename: B.dag
JOB A A.condor DONE
JOB B B.condor
PARENT A CHILD B
So, my understanding is that job B will only run once all jobs
described by A.condor are completed. For example, lets say the
following submit files were enqueued:
1. A.condor
2. B.dag
3. A.condor
Then, would B.dag only run once #1 is completed or once all
submits matching A.condor are completed or is there something
I don't understand?
I'm not sure I follow, but...
a) DAGMan expects that A.condor will submit exactly one job--no more,
no less. So your question about "all submits matching A.condor"
doesn't quite make sense.
b) DAGMan doesn't look at other jobs that have been submitted, it only
tracks the jobs it's submitted. So DAGMan will only submit A.condor,
then B.condor. If A or B was separately submitted, DAGMan won't notice
or pay attention. It submits the jobs, then tracks them by their job id.
2. Is there a way to express either a submit description file or a DAG
input file so that an executable is run on each node in a cluster
only once? If not, must I enqueue a submit description file for
each node with something like:
requirements = other.hostname == 'foo'
And so forth for each host. (Note that "hostname" probably isn't
part of ClassAd, but I mean anything that uniquely identifies each
node in a cluster)
There's no easy way to say "once for each host". Even ignoring the
fact that the list of hosts is dynamic, there isn't a way. I would
probably script this: make a script that takes a list of hostnames (or
discovers it), then submits one job for each host with requirements
set to run there. You could cluster them in a DAG, if you like, though
it's not necessary.
3. Would it be possible to remove a resource provider (a machine) from
a cluster but only once the current jobs have completed as well as
all the other dependent jobs as defined by the pending DAG input
files? For example, here's an example:
# Filename: A.dag
JOB A A.condor
JOB B B.condor
PARENT A CHILD B
So, if a node is in the middle of running job A, I would like to be
notified somehow when job B has completed. However, I don't
necessarily
want to hard code that I'm waiting for job B to complete, I would
rather
express abstractly: tell me when the current jobs and dependents
have
completed.
I'm a bit confused what you're asking for, but you can tell Condor to
turn off on a machine by using "condor_off -peaceful". It will wait
for the running jobs to finish. Make sense?
This is outside of DAGMan--I'm a bit confused by how you're combining
DAGMan with removing machines from a cluster.
-alain
|