[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Lazy jobs that never really start running



>Have you done a condor_store_cred ?
>have you changed your password since you last did...

No, nothing like that.
The situation is quite simple:
I submit a few dagman jobs to the queue that spawns about 2000 jobs.
Hours later a take a look at the queue and find some tasks that newer got matched,
just sit there being idle or pretend to be running (without any shadow).

There were no configuration changes in the meantime, all jobs had the same "chance"
and priority to run.

If I restart the computer that submitted the tasks than things seem to be catch up and
contionue computing but this is a very annoying and brute force solution.

After trying fulldebug for the scheduler I found this:
7/6 12:46:53 Reached MAX_JOBS_RUNNING: no more can run, 0 jobs matched, 41 jobs idle

Which is funny because only four jobs were running in reality and 8 were thought to be running.

Cheers,
Szabolcs



>It is an annoying flaw/bug/gripe with the windows functionality that
>if your credential is wrongly stored the jobs in the queue will
>continue to match, attempt to run on a machine, the shadow is started
>on the local machine as you but barfs, the job gets kicked off the
>previous machine (after sitting there for a bit wasting time). rinse,
>repeat.
>
>I recommend any windows pools to run the following command on a regular*
>basis
>
>condor_q -global -constraint "JobRunCount>=500"
>
>This will however have potential false positives if you have long
>running jobs which can check point. It does however tend to spot
>people who have failed to store_cred since a password change very
>nicely.
>
>Matt
>
>* Talking hourly here since it does put a load on the schedd's