[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] problem submitting jobs



I am setting up condor in my institution and some machines have
trouble submitting jobs.  Here are the relevant logs:

1. Submitting machine - ShadowLog keeps repeating:
5/26 18:08:02 ******************************************************
5/26 18:08:02 ** condor_shadow (CONDOR_SHADOW) STARTING UP
5/26 18:08:02 ** /usr/local/condor/sbin/condor_shadow
5/26 18:08:02 ** $CondorVersion: 6.6.11 Mar 23 2006 $
5/26 18:08:02 ** $CondorPlatform: I386-LINUX_RH9 $
5/26 18:08:02 ** PID = 15763
5/26 18:08:02 ******************************************************
5/26 18:08:02 Using config file: /home/condor/condor_config
5/26 18:08:02 Using local config files: /home/condor/condor_config.local
5/26 18:08:02 DaemonCore: Command Socket at <193.140.60.135:1730>
5/26 18:08:03 Initializing a JAVA shadow
5/26 18:08:03 (16.0) (15763): Request to run on <172.18.2.15:1033> was ACCEPTED
5/26 18:11:13 (16.0) (15763): IO: Failed to read packet header
5/26 18:11:13 (16.0) (15763): ERROR "Can no longer talk to condor_starter on exe
cute machine (172.18.2.15)" at line 63 in file NTreceivers.C

2. Submitting machine - SchedLog keeps repeating:

5/26 18:11:13 Started shadow for job 16.0 on "<172.18.2.15:1033>",
(shadow pid = 15846)
5/26 18:11:13 Sent ad to central manager for dyuret@xxxxxxxxxxxxxxxxxxxxxx
5/26 18:11:14 IO: Failed to read packet header
5/26 18:14:24 Shadow pid 15846 for job 16.0 exited with status 4
5/26 18:14:24 ERROR: Shadow exited with job exception code!

3. When #2 repeats 5 times it says:
5/26 18:14:24 Match for cluster 16 has had 5 shadow exceptions, relinquishing.
5/26 18:14:24 Sent RELEASE_CLAIM to startd on <172.18.2.15:1033>
5/26 18:14:24 Match record (<172.18.2.15:1033>, 16, 0) deleted
5/26 18:16:13 Sent ad to central manager for dyuret@xxxxxxxxxxxxxxxxxxxxxx
5/26 18:18:28 Activity on stashed negotiator socket
5/26 18:18:28 Negotiating for owner: dyuret@xxxxxxxxxxxxxxxxxxxxxx
5/26 18:18:28 Checking consistency running and runnable jobs
5/26 18:18:28 Tables are consistent
5/26 18:18:28 Out of jobs - 1 jobs matched, 0 jobs idle, flock level = 0
5/26 18:18:28 Sent ad to central manager for dyuret@xxxxxxxxxxxxxxxxxxxxxx
5/26 18:18:30 Started shadow for job 16.0 on "<172.18.2.15:1033>",
(shadow pid = 16012)
... and the whole thing starts repeating again.

4. Execution Machine - StartLog keeps repeating:

5/26 17:53:40 DaemonCore: Command received via TCP from host
<193.140.60.135:1724>
5/26 17:53:40 DaemonCore: received command 444 (ACTIVATE_CLAIM),
calling handler (command_activate_claim)
5/26 17:53:40 Got activate_claim request from shadow (<193.140.60.135:1724>)
5/26 17:53:40 Remote job ID is 16.0
5/26 17:53:40 Got universe "JAVA" (10) from request classad
5/26 17:53:40 State change: claim-activation protocol successful
5/26 17:53:40 Changing activity: Idle -> Busy
5/26 17:56:50 Starter pid 8915 exited with status 4
5/26 17:56:50 State change: starter exited
5/26 17:56:50 Changing activity: Busy -> Idle

5. Execution Machine - StarterLog keeps repeating:

5/26 18:10:30 ******************************************************
5/26 18:10:30 ** condor_starter (CONDOR_STARTER) STARTING UP
5/26 18:10:30 ** /usr/local/condor/sbin/condor_starter
5/26 18:10:30 ** $CondorVersion: 6.6.11 Mar 23 2006 $
5/26 18:10:30 ** $CondorPlatform: I386-LINUX_RH9 $
5/26 18:10:30 ** PID = 8986
5/26 18:10:30 ******************************************************
5/26 18:10:30 Using config file: /home/condor/condor_config
5/26 18:10:30 Using local config files: /home/condor/condor_config.local
5/26 18:10:30 DaemonCore: Command Socket at <172.18.2.15:1807>
5/26 18:10:30 Done setting resource limits
5/26 18:10:30 Starter communicating with condor_shadow <193.140.60.135:1739>
5/26 18:10:30 Submitting machine is "intelligence.ku.edu.tr"
5/26 18:10:30 Initialized IO Proxy.
5/26 18:13:39 Can't connect to <193.140.60.135:1739>:0, errno = 110
5/26 18:13:39 Will keep trying for 10 seconds...
5/26 18:13:40 Connect failed for 10 seconds; returning FALSE
5/26 18:13:40 ERROR "Unable to connect to server <193.140.60.135:1739>
" at line 571 in file file_transfer.C
5/26 18:13:40 ShutdownFast all jobs.