Re: [classad-users] Dependencies and repetitions in DAGMan


Date: Fri, 5 Jun 2009 09:52:42 -0500
From: Alain Roy <roy@xxxxxxxxxxx>
Subject: Re: [classad-users] Dependencies and repetitions in DAGMan
Hi,

This mailing list is for discussion about the Condor ClassAd library, not general Condor issues. You're better off with condor-users. That said, I'll try to answer your questions.

1. In a DAG input file, it seems that the name of the submit description
  filenames given to jobs constitute a unique name when expressing
  dependencies. That was a mouthfull, so here's an example:

  # Filename: B.dag
  JOB A A.condor DONE
  JOB B B.condor
  PARENT A CHILD B

  So, my understanding is that job B will only run once all jobs
  described by A.condor are completed. For example, lets say the
  following submit files were enqueued:

  1. A.condor
  2. B.dag
  3. A.condor

  Then, would B.dag only run once #1 is completed or once all
  submits matching A.condor are completed or is there something
  I don't understand?

I'm not sure I follow, but...

a) DAGMan expects that A.condor will submit exactly one job--no more, no less. So your question about "all submits matching A.condor" doesn't quite make sense.

b) DAGMan doesn't look at other jobs that have been submitted, it only tracks the jobs it's submitted. So DAGMan will only submit A.condor, then B.condor. If A or B was separately submitted, DAGMan won't notice or pay attention. It submits the jobs, then tracks them by their job id.

2. Is there a way to express either a submit description file or a DAG
  input file so that an executable is run on each node in a cluster
  only once? If not, must I enqueue a submit description file for
  each node with something like:

  requirements = other.hostname == 'foo'

  And so forth for each host. (Note that "hostname" probably isn't
  part of ClassAd, but I mean anything that uniquely identifies each
  node in a cluster)

There's no easy way to say "once for each host". Even ignoring the fact that the list of hosts is dynamic, there isn't a way. I would probably script this: make a script that takes a list of hostnames (or discovers it), then submits one job for each host with requirements set to run there. You could cluster them in a DAG, if you like, though it's not necessary.

3. Would it be possible to remove a resource provider (a machine) from
  a cluster but only once the current jobs have completed as well as
  all the other dependent jobs as defined by the pending DAG input
  files? For example, here's an example:

  # Filename: A.dag
  JOB A A.condor
  JOB B B.condor
  PARENT A CHILD B

  So, if a node is in the middle of running job A, I would like to be
notified somehow when job B has completed. However, I don't necessarily want to hard code that I'm waiting for job B to complete, I would rather express abstractly: tell me when the current jobs and dependents have
  completed.

I'm a bit confused what you're asking for, but you can tell Condor to turn off on a machine by using "condor_off -peaceful". It will wait for the running jobs to finish. Make sense?

This is outside of DAGMan--I'm a bit confused by how you're combining DAGMan with removing machines from a cluster.

-alain

[← Prev in Thread] Current Thread [Next in Thread→]