[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Lazy jobs that never really start running



>I'm no DAGMan expert (or even well informed) but the DAGMan executable
>runs locally as a scheduling universe job and submits jobs which look
>in almost all respects like the job would have been if you had submitted it by hand.

Yes, it runs locally, but since it is submitted to the queue it also goes into the spool 
directory of condor, as a renamed executable.

>condor_status -direct <name>

In my case I did not see any significant difference.


>If you take a look at your user logs then you should see what happens
>according to the individual jobs (where they get matched to etc.)

They don't get matched, that is the problem. The matching process does not care about them at all,
as if the job was deleted from the scheduling database.

 
>This happens sometimes when shadows go awol. You can deal with it
>without a reboot in most cases by using task manager. Finding the
>condor_master and using kill process tree. That should nail it.

I'll give it a try. I was a bit affraid of shutting down the process so dratically.
And since this problem happened pretty frequently it is not really a working solution, 
just a quick hack once in a while.


>Every time I have had to do this for one of my users it was down to:
>1) An issue with the machine (running out of disk on a drive or out of memory)
>2) Too many jobs running at once - I limited it to 100 (we have well
>over 100 nodes)
>3) Using some older 2.7.x dev version (had some serious performance
>bugs - see posts passim by myself and Ian Chesal)
>After sorting the above 3 things I never had any issues again..

Very good news! Disk and memory might not be an issue but I'll limit the job count and see what happens.


>is snoopy your local machine with the schedd on it?

Yep, all workstations are named after animation film characters. :)


>You may wish to submit that with a description to the admin
>condor-admin@xxxxxxxxxxx mailbox.

I'll do so.


>Have you tried 6.6.10 instead (assuming you don't absolutely require
>the features in the 6.7 series)

I just switched from 6.7.8 to 6.6.10. I'm very curious what will happen.
Although I'll miss the STARTD_EXPRS per vm very much...

Thanks for your help Matt.

Cheers,
Szabolcs