[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] IDLE then RUN then IDLE for nothing



Hello,

When I submit some condor jobs, they begin to run (ST=R) while a few seconds. Then they return to idle (ST=I) whithout any results. I examined the logs :


The job log on the submit machine :
000 (001.000.000) 06/25 11:59:41 Job submitted from host: <192.168.1.1:54151>
...
007 (001.000.000) 06/25 11:59:58 Shadow exception!
Can no longer talk to condor_starter on execute machine (192.168.1.23)
0 - Run Bytes Sent By Job
0 - Run Bytes Received By Job


The StartLog on the execute machine :
6/25 11:59:19 Starter pid 25840 exited with status 4


The StarterLog.vm2 on the execute machine :
6/25 11:59:13 ******************************************************
6/25 11:59:13 ** condor_starter (CONDOR_STARTER) STARTING UP
6/25 11:59:13 ** $CondorVersion: 6.6.5 May 3 2004 $
6/25 11:59:13 ** $CondorPlatform: PPC-DARWIN-6_8 $
6/25 11:59:13 ** PID = 25840
6/25 11:59:13 ******************************************************
6/25 11:59:13 Using config file: /Users/condor/Programmes/condor-6.6.5/etc/condor_config
6/25 11:59:13 Using local config files: /Users/condor/Programmes/condor-6.6.5/local.cluster13/condor_config.local
6/25 11:59:13 DaemonCore: Command Socket at <192.168.1.23:55008>
6/25 11:59:13 Setting resource limits not implemented!
6/25 11:59:13 Starter communicating with condor_shadow <192.168.1.1:54937>
6/25 11:59:13 Submitting machine is "(null)"
6/25 11:59:13 ERROR "Assertion ERROR on (shadow->name())" at line 984 in file jic_shadow.C
6/25 11:59:13 ShutdownFast all jobs.



The ShadowLog on the manager machine :
6/25 11:59:58 ******************************************************
6/25 11:59:58 Using config file: /Users/condor/Programmes/condor-6.6.5/etc/condor_config
6/25 11:59:58 Using local config files: /Users/condor/Programmes/condor-6.6.5/local.E6-Xserve/condor_config.local
6/25 11:59:58 DaemonCore: Command Socket at <192.168.1.1:54955>
6/25 11:59:59 Initializing a VANILLA shadow
6/25 12:00:01 (1.2) (18193): Request to run on <192.168.1.23:54974> was ACCEPTED
6/25 12:00:01 (1.2) (18193): ERROR "Can no longer talk to condor_starter on execute machine (192.168.1.23)" at line 63 in file NTreceivers.C
6/25 12:00:01 ******************************************************
6/25 12:00:01 ** condor_shadow (CONDOR_SHADOW) STARTING UP
6/25 12:00:01 ** $CondorVersion: 6.6.5 May 3 2004 $
6/25 12:00:01 ** $CondorPlatform: PPC-DARWIN-6_8 $
6/25 12:00:01 ** PID = 18196
6/25 12:00:01 ******************************************************
6/25 12:00:01 Using config file: /Users/condor/Programmes/condor-6.6.5/etc/condor_config
6/25 12:00:01 Using local config files: /Users/condor/Programmes/condor-6.6.5/local.E6-Xserve/condor_config.local
6/25 12:00:01 DaemonCore: Command Socket at <192.168.1.1:54961>
6/25 12:00:02 Initializing a VANILLA shadow
6/25 12:00:04 (1.0) (18195): Request to run on <192.168.1.23:54974> was REFUSED
6/25 12:00:04 (1.0) (18195): Job 1.0 is being evicted
6/25 12:00:04 (1.0) (18195): logEvictEvent with unknown reason (108), aborting
6/25 12:00:04 (1.0) (18195): **** condor_shadow (condor_SHADOW) EXITING WITH STATUS 108
6/25 12:00:07 (1.1) (18196): Request to run on <192.168.1.23:54974> was REFUSED
6/25 12:00:07 (1.1) (18196): Job 1.1 is being evicted
6/25 12:00:07 (1.1) (18196): logEvictEvent with unknown reason (108), aborting
6/25 12:00:07 (1.1) (18196): **** condor_shadow (condor_SHADOW) EXITING WITH STATUS 108



It's strange, I don't understand what really happens. Could you help ?
The computers are running with Mac OS X.

Thanks,
Jérôme