[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Lazy jobs that never really start running



On 7/6/05, Horvatth Szabolcs <szabolcs@xxxxxxxxxxxxx> wrote:
> I forgot to add that I'm using 6.7.8 on Windows machines.

Have you done a condor_store_cred ?

have you changed your password since you last did...

It is an annoying flaw/bug/gripe with the windows functionality that
if your credential is wrongly stored the jobs in the queue will
continue to match, attempt to run on a machine, the shadow is started
on the local machine as you but barfs, the job gets kicked off the
previous machine (after sitting there for a bit wasting time). rinse,
repeat.

I recommend any windows pools to run the following command on a regular* basis

condor_q -global -constraint "JobRunCount>=500"

This will however have potential false positives if you have long
running jobs which can check point. It does however tend to spot
people who have failed to store_cred since a password change very
nicely.

Matt

* Talking hourly here since it does put a load on the schedd's