[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] recursive DAGs - how to implement ?



Dear All,

I was at the Condor Week in Edinburgh a while back and I
remember that a talk on the eMinerals project mentioned
something about using recursive DAGs to resubmit long
running jobs automatically. I think this would be really
useful here but I can't seem to find any info on how to
do it.

To give you some background. Our Condor pool is only available
for around 16 hours each day so jobs needing to run for
several days need to have some way of getting the intermediate
results back and then using them as input on the next run.
At present we can look at the system clock to see if time is
about to run out then exit gracefully so that Condor stages the
results back. The submit host needs a cron style script to
re-submit the job which is messy to say the least.

One idea is to have the job run only for a few hours only at a time.
If it's a time stepping code we can easy specify the number of
steps to be taken and the solution can be picked again later were we
left off. If it gets killed then a couple of hours of wallclock
have been wasted - no big deal. I imagine that this could easily
be achieved with a DAG but of course we don't know how many time steps
are needed in total a priori. This is were a recursion would be useful if
only I could work out how to do it.

Anyone have any ideas ??

cheers,

-ian.

PS pedantic point: doesn't a recursion introduce a cyclic dependency ?
  Why not just allow cycles in the DAG dependency graph ?