[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Marking child as DONE



Dear all,

After a DAG has run partway through, I've decided that the bottom-most post-processing job (several thousand of them) should/can not be run. When my rescue DAG comes, as it inevitably does, I would like not to execute these. So far, no problem; a one-line bash/sed invocation takes care of that:

cat $f | sed 's/.*mysubfile.*/& DONE/' > ${f}.sires_done;

The problem is that not all of the parents have completed successfully. I'd like to resubmit the parents, but not these children. When I naively mark them as DONE, as above, I get the following error while dagman parses the DAG.

3/13 20:25:13 ERROR: AddParent( ea0bca7d3503cccca43dff66a99c1516 ) failed for no de a5bf08f49f3323fdd5f838f6d89918f7: STATUS_DONE child may not be given a n
ew STATUS_READY     parent

Removing the JOB lines produces an error that the parent-child relationships refer to a non-existent job. (I don't have the exact message handy.)

I see a few solutions, none of which I like:
* resubmit without modification and let the children fail (wastes resources) * change the submit files to point to /bin/true and run in the local universe (a lot of scheduling overhead, I'd think, but maybe this is negligible) * identify all nodes of a class and remove all references to each of them (more code than I want to write at the moment)

Can I get some gut reactions to these options or perhaps new, cleverer options?

Thanks,
Nick

===================================
Nickolas Fotopoulos
nvf@xxxxxxxxxxxxxxxxxxxx

Office: (414) 229-6438
Fax: (414) 229-5589
University of Wisconsin - Milwaukee
Physics Bldg, Rm 471
===================================