[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Lazy jobs that never really start running



On 7/6/05, Horvatth Szabolcs <szabolcs@xxxxxxxxxxxxx> wrote:
> >Have you done a condor_store_cred ?
> >have you changed your password since you last did...
> 
> No, nothing like that.
> The situation is quite simple:
> I submit a few dagman jobs to the queue that spawns about 2000 jobs.
> Hours later a take a look at the queue and find some tasks that newer got matched,
> just sit there being idle or pretend to be running (without any shadow).
> 
> There were no configuration changes in the meantime, all jobs had the same "chance"
> and priority to run.
> 
> If I restart the computer that submitted the tasks than things seem to be catch up and
> contionue computing but this is a very annoying and brute force solution.
> 
> After trying fulldebug for the scheduler I found this:
> 7/6 12:46:53 Reached MAX_JOBS_RUNNING: no more can run, 0 jobs matched, 41 jobs idle
> 
> Which is funny because only four jobs were running in reality and 8 were thought to be running.

How many computing nodes do you have. How much effort is it to
start/finish* the jobs.

Your schedd machine may be dying under the load... 200 concurrent jobs
is optimistic on windows without some special tweaks in the registry.
I would suggest 100 is a more realistic maximum to try initially.

Hard to say without more logs. Was there anything in the ShadowLog? Is
your disk filling up?

Matt

* size of files staged to the machine, size of files returned on completion