[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] job running on two hosts?



I'm running a standard universe job, and this is what the log file
says:

000 (024.006.000) 11/15 22:57:33 Job submitted from host: <129.100.75.77:9657>
...
001 (024.006.000) 11/16 00:05:40 Job executing on host: <129.100.75.77:9668>
...
001 (024.006.000) 11/16 02:13:35 Job executing on host: <129.100.75.60:9622>
...
005 (024.006.000) 11/16 02:13:35 Job terminated.
        (1) Normal termination (return value 1)
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
        1169  -  Run Bytes Sent By Job
        1751048  -  Run Bytes Received By Job
        1169  -  Total Bytes Sent By Job
        1751048  -  Total Bytes Received By Job
...

There's no explanation of why the job was rerun on the second host.
[Is this a bug in the logging?]

And when it ran the second time, it seemed to start at the beginning,
because it tried to open its output file, and it noticed that it
already existed and quit right away.

So somehow things got into a state where the output file was
already created, but the job wasn't restarted from a checkpoint.
Any idea what could have happened?  The same thing happened to
a bunch of my jobs...

I've attached the relevant parts of the log files from the submit
machine, which was also the first execute machine.  I can also find
the log files on the other machines if this would be useful.

Thanks for any help.

Dan


jdc@jdc:/var/log/condor$ grep 24.6 *
SchedLog:11/16 00:05:40 Started shadow for job 24.6 on "<129.100.75.77:9668>", (shadow pid = 8413)
SchedLog:11/16 02:13:35 Started shadow for job 24.6 on "<129.100.75.60:9622>", (shadow pid = 10713)
ShadowLog:11/16 00:05:40 (?.?) (8413):Hostname = "<129.100.75.77:9668>", Job = 24.6
ShadowLog:11/16 00:05:40 (24.6) (8413):Requesting Primary Starter
ShadowLog:11/16 00:05:40 (24.6) (8413):Shadow: Request to run a job was ACCEPTED
ShadowLog:11/16 00:05:40 (24.6) (8413):Shadow: RSC_SOCK connected, fd = 17
ShadowLog:11/16 00:05:40 (24.6) (8413):Shadow: CLIENT_LOG connected, fd = 18
ShadowLog:11/16 00:05:40 (24.6) (8413):My_Filesystem_Domain = "jdc.math.uwo.ca"
ShadowLog:11/16 00:05:40 (24.6) (8413):My_UID_Domain = "math.uwo.ca"
ShadowLog:11/16 00:05:40 (24.6) (8413):	Entering pseudo_get_file_stream
ShadowLog:11/16 00:05:40 (24.6) (8413):	file = "/var/spool/condor/spool/cluster24.ickpt.subproc0"
ShadowLog:11/16 00:05:40 (24.6) (8413):	129.100.75.77
ShadowLog:11/16 00:05:40 (24.6) (8413):	129.100.75.77
ShadowLog:11/16 00:05:40 (24.6) (8413):Reaped child status - pid 8415 exited with status 0
ShadowLog:11/16 00:05:40 (24.6) (8413):Read: User Job - $CondorPlatform: I386-LINUX_RH9 $
ShadowLog:11/16 00:05:40 (24.6) (8413):Read: User Job - $CondorVersion: 6.7.2 Sep 27 2004 $
ShadowLog:11/16 00:05:40 (24.6) (8413):Read: Checkpoint file name is "/var/spool/condor/spool/cluster24.proc6.subproc0"
ShadowLog:11/16 02:13:35 (?.?) (10713):Hostname = "<129.100.75.60:9622>", Job = 24.6
ShadowLog:11/16 02:13:35 (24.6) (10713):Requesting Primary Starter
ShadowLog:11/16 02:13:35 (24.6) (10713):Shadow: Request to run a job was ACCEPTED
ShadowLog:11/16 02:13:35 (24.6) (10713):Shadow: RSC_SOCK connected, fd = 17
ShadowLog:11/16 02:13:35 (24.6) (10713):Shadow: CLIENT_LOG connected, fd = 18
ShadowLog:11/16 02:13:35 (24.6) (10713):My_Filesystem_Domain = "jdc.math.uwo.ca"
ShadowLog:11/16 02:13:35 (24.6) (10713):My_UID_Domain = "math.uwo.ca"
ShadowLog:11/16 02:13:35 (24.6) (10713):	Entering pseudo_get_file_stream
ShadowLog:11/16 02:13:35 (24.6) (10713):	file = "/var/spool/condor/spool/cluster24.ickpt.subproc0"
ShadowLog:11/16 02:13:35 (24.6) (10713):	129.100.75.77
ShadowLog:11/16 02:13:35 (24.6) (10713):	129.100.75.77
ShadowLog:11/16 02:13:35 (24.6) (10713):Reaped child status - pid 10714 exited with status 0
ShadowLog:11/16 02:13:35 (24.6) (10713):Read: User Job - $CondorPlatform: I386-LINUX_RH9 $
ShadowLog:11/16 02:13:35 (24.6) (10713):Read: User Job - $CondorVersion: 6.7.2 Sep 27 2004 $
ShadowLog:11/16 02:13:35 (24.6) (10713):Read: Checkpoint file name is "/var/spool/condor/spool/cluster24.proc6.subproc0"
ShadowLog:11/16 02:13:35 (24.6) (10713):Shadow: Job 24.6 exited, termsig = 0, coredump = 0, retcode = 1
ShadowLog:11/16 02:13:35 (24.6) (10713):Shadow: Job exited normally with status 1
ShadowLog:11/16 02:13:35 (24.6) (10713):user_time = 2 ticks
ShadowLog:11/16 02:13:35 (24.6) (10713):sys_time = 3 ticks
ShadowLog:11/16 02:13:35 (24.6) (10713):Static Policy: removing job because OnExitRemove has become true
ShadowLog:11/16 02:13:35 (24.6) (10713):********** Shadow Exiting(100) **********
StartLog:11/16 00:05:40 vm2: Remote job ID is 24.6
StarterLog.vm2:11/16 00:05:40 	Id: 24.6
StarterLog.vm2:11/16 00:05:40 User Process 24.6 {
StarterLog.vm2:11/16 00:05:40   cur_ckpt = dir_8414/condor_exec.24.6
StarterLog.vm2:11/16 00:05:40 Entering get_file( /var/spool/condor/spool/cluster24.ickpt.subproc0, dir_8414/condor_exec.24.6, 0755 )
StarterLog.vm2:11/16 00:05:40 Fetched orig ckpt file "/var/spool/condor/spool/cluster24.ickpt.subproc0" into "dir_8414/condor_exec.24.6" with 1 attempt
StarterLog.vm2:11/16 00:05:40 Executable 'dir_8414/condor_exec.24.6' is linked with "$CondorVersion: 6.7.2 Sep 27 2004 $" on a "$CondorPlatform: I386-LINUX_RH9 $"
StarterLog.vm2:11/16 00:05:40 Calling execve( "/var/spool/condor/execute/dir_8414/condor_exec.24.6", "condor_exec.24.6", "-_condor_cmd_fd", "14", "-m", "0", "-g", "-s", "5", "-x", "1000000000", "-c", "4.0", "-f", "output-two-fixed-points/0.c4", 0, "CONDOR_VM=vm2", "_condor_LOWPORT=9600", "_condor_HIGHPORT=9700", "CONDOR_SCRATCH_DIR=/var/spool/condor/execute/dir_8414", 0 )