[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Marking child as DONE



Hi Nick,

Which cluster where you running on? We can open a problem
report up with the Condor team but we need to know which
version of Condor and DAGman was being used.

Thanks,

Scott

>  Scott,
> 
>  The noop route seems the most appealing.  I tried it and it appeared to work 
>  for a while, but I think I ran into a Condor bug ~200 jobs into the noop 
>  jobs:
> 
>  ...
>  ...
>  3/15 17:20:03 submitting: condor_submit -a dag_node_name' '=' 
>  '97d736e41f382cd8477d1ff0e8ae484f -a +DAGManJobId' '=' '1653119 -a 
>  DAGManJobId' '=' '1653119 -a submit_event_notes' '=' 'DAG' 'Node:' 
>  '97d736e41f382cd8477d1ff0e8ae484f -a macrooutput' '=' 
>  'L1-SIRE_FIRST_GRB070714B_injections21-868429419-2048.xml -a macrosummary' 
>  '=' 'L1-SIRE_FIRST_GRB070714B_injections21-868429419-2048.txt -a 
>  macrousertag' '=' 'GRB070714B_injections21 -a macroglob' '=' 
>  'L1-INSPIRAL_FIRST_GRB070714B_injections21_150-868429419-2048.xml.gz -a 
>  macroifocut' '=' 'L1 -a +DAGParentNodeNames' '=' 
>  '"eb25ace1446075cf9f2d9f0eb93e0ae6" 
>  injections21.sire.GRB070714B_injections21.sub
>  3/15 17:20:04 From submit: Submitting job(s).
>  3/15 17:20:04 From submit: Logging submit event(s).
>  3/15 17:20:04 From submit: 1 job(s) submitted to cluster 1653253.
>  3/15 17:20:04   assigned Condor ID (1653253.0)
>  ...
>  ...
>  3/15 17:20:04 Number of idle job procs: 0
>  3/15 17:20:04 Event: ULOG_JOB_TERMINATED for Condor Node 
>  97d736e41f382cd8477d1ff0e8ae484f (1653253.0)
>  3/15 17:20:04 BAD EVENT: job (1653253.0.0) ended, submit count < 1 (0)
>  3/15 17:20:04 BAD EVENT is warning only
>  3/15 17:20:04 ERROR "Assertion ERROR on (node->_queuedNodeJobProcs >= 0)" at 
>  line 3024 in file dag.C
> 
> 
> 
>  On Mar 14, 2008, at 7:22 PM, Scott Koranda wrote:
> 
> > Hi Nick,
> >
> > Rather than setting 'executable = /bin/true' you could add to
> > the submit file 'hold = True'. The child jobs will then be submitted
> > and held and will not run unless you explicitly call
> > condor_release on them.
> >
> > In a similar way you could set 'noop_job = True' for the child
> > jobs and the jobs will simply be marked as completed with a
> > return value of 0.
> >
> > Scott
> >
> >> Dear all,
> >>
> >> After a DAG has run partway through, I've decided that the bottom-most
> >> post-processing job (several thousand of them) should/can not be run.
> >> When my rescue DAG comes, as it inevitably does, I would like not to
> >> execute these.  So far, no problem; a one-line bash/sed invocation
> >> takes care of that:
> >>
> >> cat $f | sed 's/.*mysubfile.*/& DONE/' > ${f}.sires_done;
> >>
> >> The problem is that not all of the parents have completed
> >> successfully.  I'd like to resubmit the parents, but not these
> >> children.  When I naively mark them as DONE, as above, I get the
> >> following error while dagman parses the DAG.
> >>
> >> 3/13 20:25:13 ERROR: AddParent( ea0bca7d3503cccca43dff66a99c1516 )
> >> failed for no
> >> de a5bf08f49f3323fdd5f838f6d89918f7: STATUS_DONE      child may not be
> >> given a n
> >> ew STATUS_READY     parent
> >>
> >> Removing the JOB lines produces an error that the parent-child
> >> relationships refer to a non-existent job.  (I don't have the exact
> >> message handy.)
> >>
> >> I see a few solutions, none of which I like:
> >> * resubmit without modification and let the children fail (wastes
> >> resources)
> >> * change the submit files to point to /bin/true and run in the local
> >> universe (a lot of scheduling overhead, I'd think, but maybe this is
> >> negligible)
> >> * identify all nodes of a class and remove all references to each of
> >> them (more code than I want to write at the moment)
> >>
> >> Can I get some gut reactions to these options or perhaps new, cleverer
> >> options?
> >>
> >> Thanks,
> >> Nick
> >>
> >> ===================================
> >> Nickolas Fotopoulos
> >> nvf@xxxxxxxxxxxxxxxxxxxx
> >>
> >> Office: (414) 229-6438
> >> Fax: (414) 229-5589
> >> University of Wisconsin - Milwaukee
> >> Physics Bldg, Rm 471
> >> ===================================
> >>
> >> _______________________________________________
> >> Condor-users mailing list
> >> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> >> subject: Unsubscribe
> >> You can also unsubscribe by visiting
> >> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> >>
> >> The archives can be found at:
> >> https://lists.cs.wisc.edu/archive/condor-users/
> 
>  ===================================
>  Nickolas Fotopoulos
>  nvf@xxxxxxxxxxxxxxxxxxxx
> 
>  Office: (414) 229-6438
>  Fax: (414) 229-5589
>  University of Wisconsin - Milwaukee
>  Physics Bldg, Rm 471
>  ===================================