[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Job in idle for long - condor_write(): Socket failure



On 10/11/06, Rajam S <rajam.suryanarayanan@xxxxxxxxx> wrote:

Hi all,

My condor job sometimes remains in the idle state for a very long time
even if machines are available. When I checked the logs, I found that
the job is matching and is getting scheduled on some machine. The
SchedLog shows the following error.

SchedLog (The jobid is 214.0, submit machine is 172.31.44.201, target
machine is 172.31.44.79) :

10/11 16:21:17 (pid:14646) Checking consistency running and runnable
jobs
10/11 16:21:17 (pid:14646) Tables are consistent
10/11 16:21:17 (pid:14646) Out of jobs - 1 jobs matched, 0 jobs idle,
flock level = 0
10/11 16:21:17 (pid:14646) attempt to connect to <172.31.44.201:38283>
failed
10/11 16:21:18 (pid:14646) condor_write(): Socket closed when trying to
write buffer, fd is 11, errno=107
10/11 16:21:18 (pid:14646) Buf::write(): condor_write() failed
10/11 16:21:18 (pid:14646) SECMAN: failed to end classad message
10/11 16:21:18 (pid:14646) ERROR: SECMAN:2007:Failed to end classad
message
10/11 16:21:18 (pid:14646) Couldn't send REQUEST_CLAIM to startd at
<172.31.40.79:43256>
10/11 16:21:18 (pid:14646) attempt to connect to <172.31.44.201:38257>
failed
10/11 16:21:38 (pid:14646) Connect failed for 20 seconds; returning
FALSE
10/11 16:21:38 (pid:14646) ERROR: SECMAN:2003:TCP connection to
<172.31.40.79:43256> failed

10/11 16:21:38 (pid:14646) Sent RELEASE_CLAIM to startd on
<172.31.40.79:43256>
10/11 16:21:38 (pid:14646) Match record (<172.31.40.79:43256>, 214, 0)
deleted


Could anyone please help me on this. What could be the issue...

Thanks in advance...

--
Regards
Rajam S
Ph: 09986063805


¿Is there a firewall between them?, check they have the ports open.
Try using nmap to see that.

Regards.
--
Diego Bello Carreño
Estudiante Memorista de Ingeniería Civil Informática
UTFSM, Valparaíso, Chile
Usuario #294897 counter.li.org