[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [condor-users] Jobs Stop Migrating



Actually, since upgrading duey to Condor 6.6 from 6.4.7 (to match what is on louie), this problem has not reoccurred.

However since then, I haven't been able to run the simple MPI job listed in section 2.10.2 MPI Job Submission of the Condor manual.

Thanks,
Joel

Dan Bradley wrote:



Joel Hernandez wrote:

We have two clusters, louie and duey. Users submit their jobs on the louie cluster. When all the nodes on louie are busy, the jobs flock to the duey cluster. This works fine for three or four hours and then stops all together for several hours even though many runnable jobs are still in the queue.
The jobs start flocking again after several hours or immediately after a condor_restart is performed on louie. However, after several hours all the jobs stop migrating again. Has anyone had this problem?



Very odd. When you say that you do a condor_restart on louie, what daemons are running on the machine in question? Are you restarting the schedd, or is it just the collector and negotiator?


In the schedd logs, you should see statements about the "flock level". Can you please check what this is doing during the time when flocking is not working?

Dan Bradley
University of Wisconsin, Condor Project


Condor Support Information: http://www.cs.wisc.edu/condor/condor-support/ To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with unsubscribe condor-users <your_email_address>



Condor Support Information:
http://www.cs.wisc.edu/condor/condor-support/
To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
unsubscribe condor-users <your_email_address>