[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] job running on two hosts?



Dan Christensen <jdc@xxxxxx> writes:

> I'm running a standard universe job, and this is what the log file
> says:
>
> 000 (024.006.000) 11/15 22:57:33 Job submitted from host: <129.100.75.77:9657>
> ...
> 001 (024.006.000) 11/16 00:05:40 Job executing on host: <129.100.75.77:9668>
> ...
> 001 (024.006.000) 11/16 02:13:35 Job executing on host: <129.100.75.60:9622>
> ...
> 005 (024.006.000) 11/16 02:13:35 Job terminated.
>         (1) Normal termination (return value 1)
>                 Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
>                 Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
>                 Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
>                 Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
>         1169  -  Run Bytes Sent By Job
>         1751048  -  Run Bytes Received By Job
>         1169  -  Total Bytes Sent By Job
>         1751048  -  Total Bytes Received By Job
> ...
>
> There's no explanation of why the job was rerun on the second host.
> [Is this a bug in the logging?]
>
> And when it ran the second time, it seemed to start at the beginning,
> because it tried to open its output file, and it noticed that it
> already existed and quit right away.

Here's another clue I just found:  I got an e-mail from Condor saying
that condor_schedd died on 129.100.75.77 due to a SEGV.  I guess that
would explain the missing information in the user log file.

> Date: Tue, 16 Nov 2004 02:11:34 -0500
> 
> "/usr/sbin/condor_schedd" on "jdc.math.uwo.ca" died due to signal 11.
> Condor will automatically restart this process in 10 seconds.

But now the question is, why did it die?

Dan