Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] job running on two hosts?

Date: Wed, 17 Nov 2004 09:49:10 -0500
From: Dan Christensen <jdc@xxxxxx>
Subject: Re: [Condor-users] job running on two hosts?

Dan Christensen <jdc@xxxxxx> writes:

> I'm running a standard universe job, and this is what the log file
> says:
>
> 000 (024.006.000) 11/15 22:57:33 Job submitted from host: <129.100.75.77:9657>
> ...
> 001 (024.006.000) 11/16 00:05:40 Job executing on host: <129.100.75.77:9668>
> ...
> 001 (024.006.000) 11/16 02:13:35 Job executing on host: <129.100.75.60:9622>
> ...
> 005 (024.006.000) 11/16 02:13:35 Job terminated.
>         (1) Normal termination (return value 1)
>                 Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
>                 Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
>                 Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
>                 Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
>         1169  -  Run Bytes Sent By Job
>         1751048  -  Run Bytes Received By Job
>         1169  -  Total Bytes Sent By Job
>         1751048  -  Total Bytes Received By Job
> ...
>
> There's no explanation of why the job was rerun on the second host.
> [Is this a bug in the logging?]
>
> And when it ran the second time, it seemed to start at the beginning,
> because it tried to open its output file, and it noticed that it
> already existed and quit right away.

Here's another clue I just found:  I got an e-mail from Condor saying
that condor_schedd died on 129.100.75.77 due to a SEGV.  I guess that
would explain the missing information in the user log file.

> Date: Tue, 16 Nov 2004 02:11:34 -0500
> 
> "/usr/sbin/condor_schedd" on "jdc.math.uwo.ca" died due to signal 11.
> Condor will automatically restart this process in 10 seconds.

But now the question is, why did it die?

Dan

Follow-Ups:
- [Condor-users] same machine - twice ?
  - From: Florian Pappenberger
- Re: [Condor-users] job running on two hosts?
  - From: Erik Paulson
- Re: [Condor-users] job running on two hosts?
  - From: Chris Green

References:
- [Condor-users] Unable to re-submit dag rescue file
  - From: Michael Remijan
- Re: [Condor-users] Unable to re-submit dag rescue file
  - From: Peter F. Couvares
- [Condor-users] job running on two hosts?
  - From: Dan Christensen

Prev by Date: [Condor-users] condor rewrites my file instead of reading it, why???
Next by Date: Re: [Condor-users] condor rewrites my file instead of reading it,why???
Previous by thread: [Condor-users] job running on two hosts?
Next by thread: Re: [Condor-users] job running on two hosts?
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [Condor-users] job running on two hosts?