[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[condor-users] Submission and Worker Machines Out of Sync


I'm experiencing a strange problem with Condor. When a machine, where a simulation is running, is rebooted, switched off, or simply disconnected from the network for a period of time and then connected back, the shadow process on the submission machine still thinks the simulation is running on that machine. After the "outage" the worker machine appears in the pool in Unclaimed or Owner state. So, simply by comparing the output of condor_status and condor_q -run one can see the discrepancy -- according to condor_status the machine is not running anything, whereas according to condor_q it is. Moreover, Condor may even start another job on this worker machine. In this case, condor_q shows that the machine is running several jobs at the same time.

Condor does not see this discrepancy even hours after the "outage". Basically, it never notices anything.

Worker machines are running 6.5.5 and 6.5.4, whereas the central manager /submission machine (Linux) is running 6.6.0 (the same problem appeared also with 6.5.5 on the central manager/submission machine). I'm wondering if anybody else has experienced this problem or is it my mistake in configuring the pool. I'm using default settings as far as I can see.

Kind Regards, Alexander Klyubin Condor Support Information: http://www.cs.wisc.edu/condor/condor-support/ To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with unsubscribe condor-users <your_email_address>