[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Lazy jobs that never really start running



On 7/6/05, Horvatth Szabolcs <szabolcs@xxxxxxxxxxxxx> wrote:
> >I'm no DAGMan expert (or even well informed) but the DAGMan executable
> >runs locally as a scheduling universe job and submits jobs which look
> >in almost all respects like the job would have been if you had submitted it by hand.
> 
> Yes, it runs locally, but since it is submitted to the queue it also goes into the spool
> directory of condor, as a renamed executable.

I was interpreting your original message to say that the jobs that
dagman was submitting were themselves dagman executables (in some
horrific recursive multi machine trashing setup if the execute nodes
had schedd's running on them as well). I take it this is not the case.
 
> >condor_status -direct <name>
> 
> In my case I did not see any significant difference.

right ho, didn't expect anything but worth checking.
 
> >If you take a look at your user logs then you should see what happens
> >according to the individual jobs (where they get matched to etc.)
> 
> They don't get matched, that is the problem. The matching process does not care about them at all,
> as if the job was deleted from the scheduling database.

For a shadow to have been fired up there must have been a match, if
you look in the NegotiatorLog you should see the negotiation assigning
them to the relevant places
 
> >This happens sometimes when shadows go awol. You can deal with it
> >without a reboot in most cases by using task manager. Finding the
> >condor_master and using kill process tree. That should nail it.
> 
> I'll give it a try. I was a bit affraid of shutting down the process so dratically.
> And since this problem happened pretty frequently it is not really a working solution,
> just a quick hack once in a while.

yes - just saves shutting the whole box down


> Very good news! Disk and memory might not be an issue but I'll limit the job count and see what happens.

You may also wish to increase the delay between launching shadows
(though this may require lengthening the claim timeout to avoid
wasting claims)
  

> >Have you tried 6.6.10 instead (assuming you don't absolutely require
> >the features in the 6.7 series)
> 
> I just switched from 6.7.8 to 6.6.10. I'm very curious what will happen.
> Although I'll miss the STARTD_EXPRS per vm very much...

If it works pray to $(DEITY_OF_CHOICE) for 6.8 :)

Matt