[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] startd stuck in a loop can not shut the daemon down

Hi all,

I have a faulty submit_script. Condor starter log says it can not find the executable file. Then it does a fast shutdown on all machines according to the log. If i check the process status it now changes to <defunct> and i can not shut it down... I look at teh starterLog and this is what i see every ten second.....It seems starter has crashed and startd can not tell the starter to shut down...so starter keeps trying every 10 second....i still can not figure out how to kill the processes..using linux kill command has no effect. I always have to reboot when this happens.

Condor 6.7.2 Master: RedHat WS 3 Intel
Condor 6.7.2 executing machin is teh same machine as master

CondorStarter output:

11/2 08:51:20 Submitting machine is "Thezorb.atomfx.com"
11/2 08:51:20 File transfer completed successfully.
11/2 08:51:21 Starting a VANILLA universe job with ID: 167.0
11/2 08:51:21 IWD: /opt/condor-6.7.2/local.Thezorb/execute/dir_7640
11/2 08:51:21 Output file: /opt/condor-6.7.2/local.Thezorb/execute/dir_7640/_condor_stdout_167.0
11/2 08:51:21 Error file: /opt/condor-6.7.2/local.Thezorb/execute/dir_7640/_condor_stderr_167.0
11/2 08:51:21 About to exec /opt/condor-6.7.2/local.Thezorb/execute/dir_7640/condor_exec.exe 1 1 /mnt/fileserver/production/shows/sot/comp/jm001/jm001_010/jm001_010_compLinux_v01.shk
11/2 08:51:21 Create_Process: child failed with errno 2 (No such file or directory) before exec()
11/2 08:51:21 ERROR "Create_Process(/opt/condor-6.7.2/local.Thezorb/execute/dir_7640/condor_exec.exe,condor_exec.exe 1 1 /mnt/fileserver/production/shows/sot/comp/jm001/jm001_010/jm001_010_compLinux_v01.shk, ...) failed" at line 403 in file os_proc.C
11/2 08:51:21 ShutdownFast all jobs.

CONDORSTARTLOG output: (LOOPS over and over)

11/2 09:16:03 Connect failed for 10 seconds; returning FALSE
11/2 09:16:03 ERROR: SECMAN:2003:TCP connection to <> failed

11/2 09:16:03 Send_Signal: ERROR Connect to <> failed.11/2 09:16:03 Error sending signal to starter, errno = 25 (Inappropriate ioctl for device)
11/2 09:16:03 State change: Error sending signals to starter
11/2 09:16:03 Can't connect to <>:0, errno = 111
11/2 09:16:03 Will keep trying for 10 seconds...

Thanks for any ideas, JW