[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Strange scheduling behavior in 6.8.0




Hi all. I'm having an intermittent problem since upgrading to 6.8.0 from 6.6.10 a few weeks ago. Here's the scenario:

We have a pool of about 40 machines running Linux (some FC4, some are still RH8), all running 6.8.0.

I submit a DAG with about 25 jobs. There are no inter-job dependencies, all the machines match the job criteria, and there are no other jobs running in the pool.

Most of the time, all the jobs will be appropriately scheduled and run simultaneously. However, sometimes, only about 10 of the jobs will get started (the exact number varies). DAGman has submitted them all into the queue, but they aren't matched for some reason. As the first batch of jobs finish, more are submitted, but never more than the initial count run at once.

When this behavior is occurring, if I run "condor_status", it properly lists all the machines in our pool, including the idle ones that should have been matched to jobs. If I run "condor reschedule -all", it will send the "Reschedule" command to only those 10 or so machines that are actually running jobs. If I run "condor restart -all", it will send the "Restart" command to all machines in the pool, at which point everything will return to normal--all the 'stuck' jobs get properly matched to machines.

Anyone else see something like this?

-Mike