[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [condor-users] Jobs Stop Migrating





Joel Hernandez wrote:

We have two clusters, louie and duey. Users submit their jobs on the louie cluster. When all the nodes on louie are busy, the jobs flock to the duey cluster. This works fine for three or four hours and then stops all together for several hours even though many runnable jobs are still in the queue.
The jobs start flocking again after several hours or immediately after a condor_restart is performed on louie. However, after several hours all the jobs stop migrating again. Has anyone had this problem?


Very odd. When you say that you do a condor_restart on louie, what daemons are running on the machine in question? Are you restarting the schedd, or is it just the collector and negotiator?

In the schedd logs, you should see statements about the "flock level". Can you please check what this is doing during the time when flocking is not working?

Dan Bradley
University of Wisconsin, Condor Project


Condor Support Information: http://www.cs.wisc.edu/condor/condor-support/ To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with unsubscribe condor-users <your_email_address>