[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] "Job disconnected" error



hi,

I am submitting a job from the central manager of a pool of 5 machines [Windows
2000 machines with Condor 6.8.4, Universe = vanilla]. I keep getting the "job
disconnected" error even when the job is being executed on the same machine
where it has been submitted. How can that happen? How can 108 bytes be received
by the job, yet it is disconnected? Will someone please help me understand what
is going on? I asked a similar question earlier but the answer I got didn't
solve the problem. Neither did answers from the archives.

(NB: For some reason, my condor_config.local files on all machines are empty by
default after installation. Are they supposed to be like that?)

The log file contains:

000 (005.000.000) 05/11 11:39:14 Job submitted from host: <10.2.28.50:1055>
...
001 (005.000.000) 05/11 12:29:54 Job executing on host: <10.2.28.50:1056>
...
007 (005.000.000) 05/11 12:29:59 Shadow exception!
	Error from starter on lab121machine7.icsdomain.uonbi.ac.ke:
Create_Process(C:\condor\execute\dir_3308\condor_exec.exe,, ...) failed
	0  -  Run Bytes Sent By Job
	108  -  Run Bytes Received By Job
...
001 (005.000.000) 05/11 12:30:26 Job executing on host: <10.2.28.50:1056>
...
022 (005.000.000) 05/11 12:30:27 Job disconnected, attempting to reconnect
    Socket between submit and execute hosts closed unexpectedly
    Trying to reconnect to lab121machine7.icsdomain.uonbi.ac.ke <10.2.28.50:1056>
...
024 (005.000.000) 05/11 12:30:36 Job reconnection failed
    Job not found at execution machine
    Can not reconnect to lab121machine7.icsdomain.uonbi.ac.ke, rescheduling job
...