[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Lazy jobs that never really start running



On 7/6/05, Horvatth Szabolcs <szabolcs@xxxxxxxxxxxxx> wrote:
> >You're transferring dagman itself? Why?
> 
> I use the default dagman_submit command and that creates a submit file that
> transfers the executable by default. At least it seems to me...

I'm no DAGMan expert (or even well informed) but the DAGMan executable
runs locally as a scheduling universe job and submits jobs which look
in almost all respects like the job would have been if you had
submitted it by hand. The execuable transferred is not DAGMan it is
your original executable albeit renamed...
 
> >condor_status reports what the *collector* says. this is always
> >delayed (or plain inaccurate if there are problems with a machine as
> >it tends to fail to report the right thing).
> 
> I see. And how can I get the *real* computer info?

condor_status -direct <name>

Though it may no longer be responding terribly well...

If you take a look at your user logs then you should see what happens
according to the individual jobs (where they get matched to etc.)

 
> >The machine may be loosing track of the shadows.
> 
> >How about the MasterLog (reports of processes dying and the like
> 
> I don't see anything like that. It looks OK.

shame...
 
> >Does a condor_reconfig do the same?
> 
> No, reconfig does not fix the problem.

right.
 
> >How about net stop condor/net start condor?
> 
> I tried that but the condor process could not be stopped (thats why I had to restart
> the machine). I was kinda surprised that the jobs went along nicely, except a DAG job
> that "forgot" to submit its child tasks after it completed.

This happens sometimes when shadows go awol. You can deal with it
without a reboot in most cases by using task manager. Finding the
condor_master and using kill process tree. That should nail it.

Every time I have had to do this for one of my users it was down to:
1) An issue with the machine (running out of disk on a drive or out of memory)
2) Too many jobs running at once - I limited it to 100 (we have well
over 100 nodes)
3) Using some older 2.7.x dev version (had some serious performance
bugs - see posts passim by myself and Ian Chesal)

After sorting the above 3 things I never had any issues again..

> ---
> To: szabolcs.horvatth@xxxxxxxxxxxxxxxxx
> From: SYSTEM@snoopy
> Subject: [Condor] Problem
> 
> This is an automated email from the Condor system
> on machine "snoopy.digicpictures.local".  Do not reply.

is snoopy your local machine with the schedd on it?
 
> "C:\Condor/bin/condor_schedd.exe" on "snoopy.digicpictures.local" died due to exception ACCESS_VIOLATION.

<snip>

You may wish to submit that with a description to the admin
condor-admin@xxxxxxxxxxx mailbox.

Though I think you may find this is a machine issue...
Have you tried 6.6.10 instead (assuming you don't absolutely require
the features in the 6.7 series)
Have you tried placing the submitter on a seperate machine

Matt