[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] some held/released jobs never execute



Hi,

With Condor 6.7.19 on windows XP I have a strange problem. Sometimes when I'd like to vacate a job from a machine the condor_vacate_job command does not work, and I have to hold and release the job to re-negotiate it.
But after going back into the idle state the job never gets executed again.

This is what I found in the shadow log for such a job:

6/16 12:35:56 (68770.0) (472): Got SIGTERM. Performing graceful shutdown.
6/16 12:35:57 (68770.0) (472): getpeername failed so connect must have failed 6/16 12:36:16 (68770.0) (472): Connect failed for 20 seconds; returning FALSE 6/16 12:36:16 (68770.0) (472): RemoteResource::killStarter(): Could not send command to startd


And this is the last item for it in the scheduler log:
6/16 12:11:36 Starting add_shadow_birthdate(68770.0)
6/16 12:11:36 Started shadow for job 68770.0 on "<192.168.0.101:1040>", (shadow pid = 472)


What is going on behind the scenes and what can I do to force the execution of this job?

Cheers,
Szabolcs