[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Condor jobs hang



Hi,

I'm running a small pool with 5 machines running SuSE Linux 9.1.
Single job submissions run and complete fine.

When I submit multiple small jobs, the jobs run and based on their
output should be complete (e.g. the job output files display the output
that one would expect to see upon termination), but condor_q still lists
the jobs as running hours after the output indicates that they should have
completed. condor_status also lists the machines as claimed/busy, but the
load average (as reported by condor_status and top) are approximately
zero.

I am seeing the follow two type of error sequences repeat in ShadowLog:

3/11 07:58:37 (7.6) (1655):sys_time = 12 ticks
3/11 07:58:37 (7.6) (1655):condor_read(): recv() returned -1, errno = 104, assum
ing failure.
3/11 07:58:37 (7.6) (1655):AUTHENTICATE: handshake failed!
3/11 07:58:37 (7.6) (1655):Authentication Error
AUTHENTICATE:1002:Failure performing handshake
3/11 07:58:37 (7.6) (1655):ERROR "Failed to connect to schedd!" at line 1000 in
file shadow.C
3/11 07:58:37 (7.6) (1655):Shadow: DoCleanup: unlinking TmpCkpt '/home/condor/sp
ool/cluster7.proc6.subproc0.tmp'
3/11 07:58:37 (7.6) (1655):Trying to unlink
/home/condor/spool/cluster7.proc6.subproc0.tmp

And also:

/11 07:59:56 (7.21) (3921):Shadow: Job 7.21 exited, termsig = 0, coredump = 0,
retcode = 0
3/11 07:59:56 (7.21) (3921):Shadow: Job exited normally with status 0
3/11 07:59:56 (7.21) (3921):user_time = 15 ticks
3/11 07:59:56 (7.21) (3921):sys_time = 34 ticks
3/11 07:59:56 (7.21) (3921):condor_write(): Socket closed when trying to write buffer
3/11 07:59:56 (7.21) (3921):Buf::write(): condor_write() failed
3/11 07:59:56 (7.21) (3921):AUTHENTICATE: handshake failed!
3/11 07:59:56 (7.21) (3921):Authentication Error
AUTHENTICATE:1002:Failure performing handshake
3/11 07:59:56 (7.21) (3921):ERROR "Failed to connect to schedd!" at line
1000 in file shadow.C
3/11 07:59:56 (7.21) (3921):Shadow: DoCleanup: unlinking TmpCkpt '/home/condor/s
pool/cluster7.proc21.subproc0.tmp'
3/11 07:59:56 (7.21) (3921):Trying to unlink /home/condor/spool/cluster7.proc21.
subproc0.tmp


Any suggestions?

Thanks in advance,
Russ

-------------------------------------------------------------
Russ Joseph                           Technological Institute
Assistant Professor                   2145 Sheridan Road
Electrical and Computer Engineering   Evanston, IL 60208
Northwestern University               voice: 847-491-3061
rjoseph@xxxxxxxxxxxxxxxxxxxx          fax:   847-467-4144
           http://www.ece.northwestern.edu/~rjoseph
-------------------------------------------------------------