[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Marking child as DONE



Scott,

The noop route seems the most appealing. I tried it and it appeared to work for a while, but I think I ran into a Condor bug ~200 jobs into the noop jobs:

...
...
3/15 17:20:03 submitting: condor_submit -a dag_node_name' '=' '97d736e41f382cd8477d1ff0e8ae484f -a +DAGManJobId' '=' '1653119 -a DAGManJobId' '=' '1653119 -a submit_event_notes' '=' 'DAG' 'Node:' '97d736e41f382cd8477d1ff0e8ae484f -a macrooutput' '=' 'L1- SIRE_FIRST_GRB070714B_injections21-868429419-2048.xml -a macrosummary' '=' 'L1-SIRE_FIRST_GRB070714B_injections21-868429419-2048.txt -a macrousertag' '=' 'GRB070714B_injections21 -a macroglob' '=' 'L1- INSPIRAL_FIRST_GRB070714B_injections21_150-868429419-2048.xml.gz -a macroifocut' '=' 'L1 -a +DAGParentNodeNames' '=' '"eb25ace1446075cf9f2d9f0eb93e0ae6" injections21.sire.GRB070714B_injections21.sub
3/15 17:20:04 From submit: Submitting job(s).
3/15 17:20:04 From submit: Logging submit event(s).
3/15 17:20:04 From submit: 1 job(s) submitted to cluster 1653253.
3/15 17:20:04   assigned Condor ID (1653253.0)
...
...
3/15 17:20:04 Number of idle job procs: 0
3/15 17:20:04 Event: ULOG_JOB_TERMINATED for Condor Node 97d736e41f382cd8477d1ff0e8ae484f (1653253.0)
3/15 17:20:04 BAD EVENT: job (1653253.0.0) ended, submit count < 1 (0)
3/15 17:20:04 BAD EVENT is warning only
3/15 17:20:04 ERROR "Assertion ERROR on (node->_queuedNodeJobProcs >= 0)" at line 3024 in file dag.C



On Mar 14, 2008, at 7:22 PM, Scott Koranda wrote:

Hi Nick,

Rather than setting 'executable = /bin/true' you could add to
the submit file 'hold = True'. The child jobs will then be submitted
and held and will not run unless you explicitly call
condor_release on them.

In a similar way you could set 'noop_job = True' for the child
jobs and the jobs will simply be marked as completed with a
return value of 0.

Scott

Dear all,

After a DAG has run partway through, I've decided that the bottom- most
post-processing job (several thousand of them) should/can not be run.
When my rescue DAG comes, as it inevitably does, I would like not to
execute these.  So far, no problem; a one-line bash/sed invocation
takes care of that:

cat $f | sed 's/.*mysubfile.*/& DONE/' > ${f}.sires_done;

The problem is that not all of the parents have completed
successfully.  I'd like to resubmit the parents, but not these
children.  When I naively mark them as DONE, as above, I get the
following error while dagman parses the DAG.

3/13 20:25:13 ERROR: AddParent( ea0bca7d3503cccca43dff66a99c1516 )
failed for no
de a5bf08f49f3323fdd5f838f6d89918f7: STATUS_DONE child may not be
given a n
ew STATUS_READY     parent

Removing the JOB lines produces an error that the parent-child
relationships refer to a non-existent job.  (I don't have the exact
message handy.)

I see a few solutions, none of which I like:
* resubmit without modification and let the children fail (wastes
resources)
* change the submit files to point to /bin/true and run in the local
universe (a lot of scheduling overhead, I'd think, but maybe this is
negligible)
* identify all nodes of a class and remove all references to each of
them (more code than I want to write at the moment)

Can I get some gut reactions to these options or perhaps new, cleverer
options?

Thanks,
Nick

===================================
Nickolas Fotopoulos
nvf@xxxxxxxxxxxxxxxxxxxx

Office: (414) 229-6438
Fax: (414) 229-5589
University of Wisconsin - Milwaukee
Physics Bldg, Rm 471
===================================

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/

===================================
Nickolas Fotopoulos
nvf@xxxxxxxxxxxxxxxxxxxx

Office: (414) 229-6438
Fax: (414) 229-5589
University of Wisconsin - Milwaukee
Physics Bldg, Rm 471
===================================