[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] dead jobs (even remove doen't work)



hello

I queued a dagman with about 1000 nodes. They were all queued properly but 2 of that jobs hang. If I hold them and then release them, they stay idle forever. If I remove them they stay marked as removed in the queue. Even a condor_restart doesn't help. Only a reboot of that machine.

My ShaodwLog has some Authentication errors:
12/7 19:47:37 ******************************************************
12/7 19:47:37 ** condor_shadow (CONDOR_SHADOW) STARTING UP
12/7 19:47:37 ** $CondorVersion: 6.6.5 May 3 2004 $
12/7 19:47:37 ** $CondorPlatform: I386-LINUX-RH9 $
12/7 19:47:37 ** PID = 10166
12/7 19:47:37 ******************************************************
12/7 19:47:37 Using config file: /opt/condor//condor_config
12/7 19:47:37 Using local config files: /opt/condor/etc/condor_config.local
12/7 19:47:37 DaemonCore: Command Socket at <134.130.4.77:9688>
12/7 19:47:38 (2689.0) (9690): **** condor_shadow (condor_SHADOW) EXITING WITH STATUS 100
12/7 19:47:38 Initializing a VANILLA shadow
12/7 19:47:38 (2700.0) (10166): Request to run on <137.226.70.92:9615> was ACCEPTED
12/7 19:47:40 (2660.0) (9097): condor_write(): Socket closed when trying to write buffer
12/7 19:47:40 (2660.0) (9097): Buf::write(): condor_write() failed
12/7 19:47:40 (2660.0) (9097): AUTHENTICATE: handshake failed!
12/7 19:47:40 (2660.0) (9097): Authentication Error
AUTHENTICATE:1002:Failure performing handshake
12/7 19:47:40 (2660.0) (9097): Failed to update job queue!
12/7 19:47:40 (2660.0) (9097): **** condor_shadow (condor_SHADOW) EXITING WITH STATUS 100
12/7 19:47:41 (2675.0) (9596): **** condor_shadow (condor_SHADOW) EXITING WITH STATUS 100


ScheddLog:
12/7 19:47:40 DC_AUTHENTICATE: attempt to open invalid session condor1:2339:1102444850:4316, failing.


What can I do to ensure that all jobs will be executed or that jobs that seem to hang will be restarted? Every job takes about 20-40min.

In my pool there is one machine that is submitter and master, all the other machine are execute-only.

Thanks in Advance
Thomas Lisson
RWTH-Grid