[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Jobs Still not returning any output



Ok. Sorted my matching problem.

Here is the output after the job.

-rw-r--r--  1 condor users     0 2005-10-25 00:53 error_0.out
-rw-r--r--  1 condor users     0 2005-10-25 00:53 error_1.out
-rw-r--r--  1 condor users     0 2005-10-25 00:53 error_2.out
-rw-r--r--  1 condor users     0 2005-10-25 00:53 error_3.out
-rw-r--r--  1 condor users     0 2005-10-25 00:53 error_4.out
-rw-r--r--  1 condor users   239 2005-10-24 23:17 hello.sub
-rwxr-xr-x  1 condor users 10457 2005-10-11 17:04 helloworld
-rw-r--r--  1 condor users  4450 2005-10-25 00:53 log.out
-rw-r--r--  1 condor users   137 2005-10-11 17:03 Main.cpp
-rw-r--r--  1 condor users     0 2005-10-25 00:53 output_0.out
-rw-r--r--  1 condor users     0 2005-10-25 00:53 output_1.out
-rw-r--r--  1 condor users     0 2005-10-25 00:53 output_2.out
-rw-r--r--  1 condor users     0 2005-10-25 00:53 output_3.out
-rw-r--r--  1 condor users     0 2005-10-25 00:53 output_4.out

Here is the StarterLog for the only node X86_64 machine in the pool just now.

10/25 01:51:33 ** condor_starter (CONDOR_STARTER) STARTING UP
10/25 01:51:33 ** /home/condor/release/sbin/condor_starter
10/25 01:51:33 ** $CondorVersion: 6.7.10 Aug 3 2005 $
10/25 01:51:33 ** $CondorPlatform: I386-LINUX_RH9 $
10/25 01:51:33 ** PID = 13889
10/25 01:51:33 ******************************************************
10/25 01:51:33 Using config file: /home/condor/condor_config
10/25 01:51:33 Using local config files: /home/condor/release/etc/node1.local
10/25 01:51:33 DaemonCore: Command Socket at <192.168.1.101:36023>
10/25 01:51:33 Done setting resource limits
10/25 01:51:33 Communicating with shadow <192.168.1.1:60161>
10/25 01:51:33 Submitting machine is "mgmnt.cluster.int"
10/25 01:51:33 File transfer completed successfully.
10/25 01:51:34 Starting a VANILLA universe job with ID: 7.4
10/25 01:51:34 IWD: /home/condor/hosts/node1/execute/dir_13889
10/25 01:51:34 Output file: /home/condor/hosts/node1/execute/dir_13889/output_4.out
10/25 01:51:34 Error file: /home/condor/hosts/node1/execute/dir_13889/error_4.out
10/25 01:51:34 About to exec /home/condor/hosts/node1/execute/dir_13889/condor_exec.exe
10/25 01:51:34 Create_Process succeeded, pid=13891
10/25 01:51:34 Process exited, pid=13891, status=0
10/25 01:51:34 Got SIGQUIT. Performing fast shutdown.
10/25 01:51:34 ShutdownFast all jobs.
10/25 01:51:34 **** condor_starter (condor_STARTER) EXITING WITH STATUS 0


ShadowLog from submission machine (central manager)

10/25 00:53:43 ******************************************************
10/25 00:53:43 ** condor_shadow (CONDOR_SHADOW) STARTING UP
10/25 00:53:43 ** /home/condor/release/sbin/condor_shadow
10/25 00:53:43 ** $CondorVersion: 6.7.10 Aug 3 2005 $
10/25 00:53:43 ** $CondorPlatform: I386-LINUX_RH9 $
10/25 00:53:43 ** PID = 17843
10/25 00:53:43 ******************************************************
10/25 00:53:43 Using config file: /home/condor/etc/condor_config
10/25 00:53:43 Using local config files: /home/condor/release/etc/thebeast.local
10/25 00:53:43 DaemonCore: Command Socket at <192.168.1.1:60161>
10/25 00:53:43 Initializing a VANILLA shadow for job 7.4
10/25 00:53:43 (7.4) (17843): Request to run on <192.168.1.101:35998> was ACCEPTED
10/25 00:53:44 (7.4) (17843): Job 7.4 terminated: exited with status 0
10/25 00:53:44 (7.4) (17843): **** condor_shadow (condor_SHADOW) EXITING WITH STATUS 100


(There is a time scew)

condor@thebeast:~/jobs/helloworld> date
Tue Oct 25 00:58:45 BST 2005
condor@thebeast:~/jobs/helloworld> ssh node1
Last login: Tue Oct 25 01:47:15 2005 from mgmnt.cluster.int
condor@node1:~> date
Tue Oct 25 01:56:41 BST 2005
condor@node1:~>


ScheddLog from submitting machine (central manager)

10/25 00:53:16 (pid:16840) DaemonCore: Command received via UDP from host <192.168.1.1:37278>
10/25 00:53:16 (pid:16840) DaemonCore: received command 421 (RESCHEDULE), calling handler (reschedule_negotiator)
10/25 00:53:16 (pid:16840) Sent ad to central manager for condor@xxxxxxxxxxxxxxxxxxxx
10/25 00:53:16 (pid:16840) Sent ad to 1 collectors for condor@xxxxxxxxxxxxxxxxxxxx
10/25 00:53:16 (pid:16840) Called reschedule_negotiator()
10/25 00:53:28 (pid:16840) DaemonCore: Command received via TCP from host <192.168.1.1:60128>
10/25 00:53:28 (pid:16840) DaemonCore: received command 416 (NEGOTIATE), calling handler (negotiate)
10/25 00:53:28 (pid:16840) Negotiating for owner: condor@xxxxxxxxxxxxxxxxxxxx
10/25 00:53:28 (pid:16840) Checking consistency running and runnable jobs
10/25 00:53:28 (pid:16840) Tables are consistent
10/25 00:53:28 (pid:16840) Out of servers - 1 jobs matched, 4 jobs idle, 1 jobs rejected
10/25 00:53:30 (pid:16840) Starting add_shadow_birthdate(7.0)
10/25 00:53:30 (pid:16840) Started shadow for job 7.0 on "<192.168.1.101:35998>", (shadow pid = 17811)
10/25 00:53:30 (pid:16840) Sent ad to central manager for condor@xxxxxxxxxxxxxxxxxxxx
10/25 00:53:30 (pid:16840) Sent ad to 1 collectors for condor@xxxxxxxxxxxxxxxxxxxx
10/25 00:53:31 (pid:16840) Shadow pid 17811 for job 7.0 exited with status 100
10/25 00:53:33 (pid:16840) Starting add_shadow_birthdate(7.1)
10/25 00:53:33 (pid:16840) Started shadow for job 7.1 on "<192.168.1.101:35998>", (shadow pid = 17821)
10/25 00:53:34 (pid:16840) Shadow pid 17821 for job 7.1 exited with status 100
10/25 00:53:35 (pid:16840) Sent ad to central manager for condor@xxxxxxxxxxxxxxxxxxxx
10/25 00:53:35 (pid:16840) Sent ad to 1 collectors for condor@xxxxxxxxxxxxxxxxxxxx
10/25 00:53:36 (pid:16840) Starting add_shadow_birthdate(7.2)
10/25 00:53:36 (pid:16840) Started shadow for job 7.2 on "<192.168.1.101:35998>", (shadow pid = 17826)
10/25 00:53:38 (pid:16840) Shadow pid 17826 for job 7.2 exited with status 100
10/25 00:53:40 (pid:16840) Sent ad to central manager for condor@xxxxxxxxxxxxxxxxxxxx
10/25 00:53:40 (pid:16840) Sent ad to 1 collectors for condor@xxxxxxxxxxxxxxxxxxxx
10/25 00:53:40 (pid:16840) Starting add_shadow_birthdate(7.3)
10/25 00:53:40 (pid:16840) Started shadow for job 7.3 on "<192.168.1.101:35998>", (shadow pid = 17836)
10/25 00:53:41 (pid:16840) Shadow pid 17836 for job 7.3 exited with status 100
10/25 00:53:43 (pid:16840) Starting add_shadow_birthdate(7.4)
10/25 00:53:43 (pid:16840) Started shadow for job 7.4 on "<192.168.1.101:35998>", (shadow pid = 17843)
10/25 00:53:44 (pid:16840) Shadow pid 17843 for job 7.4 exited with status 100
10/25 00:53:44 (pid:16840) match (<192.168.1.101:35998>#1130201383#2) out of jobs (cluster id 7); relinquishing
10/25 00:53:44 (pid:16840) Sent RELEASE_CLAIM to startd on <192.168.1.101:35998>
10/25 00:53:44 (pid:16840) Match record (<192.168.1.101:35998>, 7, -1) deleted
10/25 00:53:45 (pid:16840) DaemonCore: Command received via TCP from host <192.168.1.101:36027>
10/25 00:53:45 (pid:16840) DaemonCore: received command 443 (VACATE_SERVICE), calling handler (vacate_service)
10/25 00:53:45 (pid:16840) Got VACATE_SERVICE from <192.168.1.101:36027>
10/25 00:53:45 (pid:16840) Sent owner (0 jobs) ad to 1 collectors


----- Original Message ----- From: "Erik Paulson" <epaulson@xxxxxxxxxxx>
To: "Condor-Users Mail List" <condor-users@xxxxxxxxxxx>
Sent: Monday, October 24, 2005 6:41 PM
Subject: Re: [Condor-users] Jobs Still not returning any output



On Mon, Oct 24, 2005 at 06:36:01PM +0100, Chris Miles wrote:
StartLog from execute node


Actually, it's the StarterLog that we need to take a look at, not the StartLog. We're also going to need to see one from about the same time as a job was running - the log from below never had a job run on that machine during the 5 minutes the log covers.