[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Marking child as DONE



Scott,

This is at UWM's Nemo cluster. Correct me if I'm wrong, but I think it is running mostly Condor 6.9.4, with pre-release 7.0.0 dagman binaries.

Thanks,
Nick

On Mar 15, 2008, at 3:57 PM, Scott Koranda wrote:

Hi Nick,

Which cluster where you running on? We can open a problem
report up with the Condor team but we need to know which
version of Condor and DAGman was being used.

Thanks,

Scott

Scott,

The noop route seems the most appealing. I tried it and it appeared to work for a while, but I think I ran into a Condor bug ~200 jobs into the noop
jobs:

...
...
3/15 17:20:03 submitting: condor_submit -a dag_node_name' '='
'97d736e41f382cd8477d1ff0e8ae484f -a +DAGManJobId' '=' '1653119 -a
DAGManJobId' '=' '1653119 -a submit_event_notes' '=' 'DAG' 'Node:'
'97d736e41f382cd8477d1ff0e8ae484f -a macrooutput' '='
'L1-SIRE_FIRST_GRB070714B_injections21-868429419-2048.xml -a macrosummary'
'=' 'L1-SIRE_FIRST_GRB070714B_injections21-868429419-2048.txt -a
macrousertag' '=' 'GRB070714B_injections21 -a macroglob' '='
'L1- INSPIRAL_FIRST_GRB070714B_injections21_150-868429419-2048.xml.gz -a
macroifocut' '=' 'L1 -a +DAGParentNodeNames' '='
'"eb25ace1446075cf9f2d9f0eb93e0ae6"
injections21.sire.GRB070714B_injections21.sub
3/15 17:20:04 From submit: Submitting job(s).
3/15 17:20:04 From submit: Logging submit event(s).
3/15 17:20:04 From submit: 1 job(s) submitted to cluster 1653253.
3/15 17:20:04   assigned Condor ID (1653253.0)
...
...
3/15 17:20:04 Number of idle job procs: 0
3/15 17:20:04 Event: ULOG_JOB_TERMINATED for Condor Node
97d736e41f382cd8477d1ff0e8ae484f (1653253.0)
3/15 17:20:04 BAD EVENT: job (1653253.0.0) ended, submit count < 1 (0)
3/15 17:20:04 BAD EVENT is warning only
3/15 17:20:04 ERROR "Assertion ERROR on (node->_queuedNodeJobProcs >= 0)" at
line 3024 in file dag.C



On Mar 14, 2008, at 7:22 PM, Scott Koranda wrote:

Hi Nick,

Rather than setting 'executable = /bin/true' you could add to
the submit file 'hold = True'. The child jobs will then be submitted
and held and will not run unless you explicitly call
condor_release on them.

In a similar way you could set 'noop_job = True' for the child
jobs and the jobs will simply be marked as completed with a
return value of 0.

Scott

Dear all,

After a DAG has run partway through, I've decided that the bottom- most post-processing job (several thousand of them) should/can not be run. When my rescue DAG comes, as it inevitably does, I would like not to
execute these.  So far, no problem; a one-line bash/sed invocation
takes care of that:

cat $f | sed 's/.*mysubfile.*/& DONE/' > ${f}.sires_done;

The problem is that not all of the parents have completed
successfully.  I'd like to resubmit the parents, but not these
children.  When I naively mark them as DONE, as above, I get the
following error while dagman parses the DAG.

3/13 20:25:13 ERROR: AddParent( ea0bca7d3503cccca43dff66a99c1516 )
failed for no
de a5bf08f49f3323fdd5f838f6d89918f7: STATUS_DONE child may not be
given a n
ew STATUS_READY     parent

Removing the JOB lines produces an error that the parent-child
relationships refer to a non-existent job.  (I don't have the exact
message handy.)

I see a few solutions, none of which I like:
* resubmit without modification and let the children fail (wastes
resources)
* change the submit files to point to /bin/true and run in the local universe (a lot of scheduling overhead, I'd think, but maybe this is
negligible)
* identify all nodes of a class and remove all references to each of
them (more code than I want to write at the moment)

Can I get some gut reactions to these options or perhaps new, cleverer
options?

Thanks,
Nick

===================================
Nickolas Fotopoulos
nvf@xxxxxxxxxxxxxxxxxxxx

Office: (414) 229-6438
Fax: (414) 229-5589
University of Wisconsin - Milwaukee
Physics Bldg, Rm 471
===================================

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users- request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/

===================================
Nickolas Fotopoulos
nvf@xxxxxxxxxxxxxxxxxxxx

Office: (414) 229-6438
Fax: (414) 229-5589
University of Wisconsin - Milwaukee
Physics Bldg, Rm 471
===================================

===================================
Nickolas Fotopoulos
nvf@xxxxxxxxxxxxxxxxxxxx

Office: (414) 229-6438
Fax: (414) 229-5589
University of Wisconsin - Milwaukee
Physics Bldg, Rm 471
===================================