[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Condor doesn't recognize that tasks have finished



I am fiding that if I have a large amount of jobs which take a while to compute that Condor will ecentually stop realizing that tasks have finished. Once this happens all my cpus become unavailable because each cpu thinks it is processing a job. If i go to the actual CPU, the usage is at 0% idle..and the result of the program was saved to the network. This happens on dual CPU and single CPU machines too.

If I place teh tasks on HOLD , Condor starts to work again but eventually sometimes teh CPU's again become available for the same resons.

Here is the error o see in the StarterLog of a machine which still thought the task was running:

5/15 23:00:50 About to exec C:\WINDOWS\System32\cmd.exe /Q /C task01.bat
5/15 23:00:50 Create_Process succeeded, pid=2588
5/15 23:01:16 Process exited, pid=2588, status=0
5/15 23:01:39 getpeername failed so connect must have failed
5/15 23:02:04 Connect failed for 30 seconds; returning FALSE
5/15 23:02:04 FileTransfer: Unable to connect to server <192.168.0.3:9611> <<<-----Windows XP STARTD Machine
5/15 23:02:04 JIC::allJobsDone() failed, waiting for job lease to expire or for a reconnect attempt << LEase never expires


Condor Collector/Negotiator: Condor 6.7.6 on Linux 7.2 / Dec Alpha

Condor StartD Machines: Condor 6.7.6 on Windows XP