[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Lazy jobs that never really start running



>How many computing nodes do you have. 

28 computers, 56 processors at the moment, dedicated to computing.

> How much effort is it to start/finish* the jobs.

All files are read from a mapped server, the only data that is transferred is
the dagman application (by default) and a small batch file that launches the
computation for the jobs. (~500byte)


>Your schedd machine may be dying under the load... 200 concurrent jobs
>is optimistic on windows without some special tweaks in the registry.
>I would suggest 100 is a more realistic maximum to try initially.

The max jobs limit of 200 is optimistic indeed but since I only have 56 processors
there is no way of launching more than that amount at the same time.
(Shadows are only started when the process is matched and launched on a machine, am I right?)
The strange thing was that Condor wrote about this maxjob limit in the log file while condor_status
only reported 4 running processes.


>Hard to say without more logs. Was there anything in the ShadowLog?

No, there were no problems in the shadow log. Since shadows were not launched at all
the log remained empty. The problem might be somewhere in the matchmaking department.

The strangest thing is that restarting the scheduling machine did fix the problem, so both
configuration issues, job/machine requirements, computer limitations are out of the question.

Cheers,
Szabolcs