[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Problems with checkpointing.



Another thing... the user whose log's I'm just checking into has told me 
that his failing jobs were condor_compile'ed under 6.7.3, and have been 
failing on 6.7.6.  I haven't heard back from the user whose snippets are 
listed earlier in the thread.

Would the jobs having been condor_compiled under 6.7.3 make a difference?

Thanks!
Paul


On Fri, 29 Apr 2005, Paul Armor wrote:

> Hi,
> here's a typical failure.  This is from a users log's from 1600 jobs 
> submitted, where 976 failed after restarting following a 
> checkpoint/eviction.  I'm just starting to go through the other users 
> logs.  I'm not sure if all jobs that checkpoint/evict fail after 
> restarting, but I don't beleive they do.
> 
> Snippets from users log:
> 
> 000 (12450.023.000) 04/25 16:36:10 Job submitted from host: <129.89.201.232:57084>
> 001 (12450.023.000) 04/25 16:40:32 Job executing on host: <129.89.200.36:32774>
> 006 (12450.023.000) 04/25 17:38:09 Image size of job updated: 52448
> ...
> 004 (12450.023.000) 04/25 17:38:10 Job was evicted.
> 	(1) Job was checkpointed.
> 		Usr 0 00:44:47, Sys 0 00:00:11  -  Run Remote Usage
> 		Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
> 		49975312  -  Run Bytes Sent By Job
> 		4201851  -  Run Bytes Received By Job
> ...
> 001 (12450.023.000) 04/25 19:45:45 Job executing on host: <129.89.201.56:32803>
> 005 (12450.023.000) 04/25 19:45:49 Job terminated.
>         (0) Abnormal termination (signal 11)
>         (0) No core file
>                 Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
>                 Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
>                 Usr 0 00:44:47, Sys 0 00:00:11  -  Total Remote Usage
>                 Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
>         304  -  Run Bytes Sent By Job
>         53707404  -  Run Bytes Received By Job
>         0  -  Total Bytes Sent By Job
>         0  -  Total Bytes Received By Job
> ...
> 009 (12450.023.000) 04/25 19:45:49 Job was aborted by the user.
> 
> 
> Thanks for your help!
> Paul
> 
> 
> 
> On Fri, 29 Apr 2005, Paul Armor wrote:
> 
> > Hi Alan,
> > 
> > > I would think that these would be identical RPMs, since we don't distribute 
> > > different binaries for RedHat 9, Fedora Core 1, or Fedora Core 3: We build 
> > > it on RedHat 9 and it just works on the Fedora Core 1-3. I know that the 
> > > download web page lists them separately--this is to make it clear what to 
> > > download. But they are identical.
> > 
> > OK, I was feeling "superstitious" ;-)
> > 
> > > I'm also a bit confused--you're installing the checkpoint server on all the 
> > > execution computers?
> > 
> > Yes, I inherited the spec file and process, so...  (P.S. we're installing 
> > the same RPM on all nodes, using same condor_config, using different 
> > condor_config.local)
> > 
> > > Can you be more specific about the errors you are getting?
> > 
> > OK, I was waiting for more details from users... I'll attach a bunch of 
> > stuff below, trying to show lifecycle of jobs, but here's a typical log 
> > entry when a job dies...  I know this job was condor_compiled on a RH9 
> > box, I don't know where it initially ran, but here it dies on a RH9 box:
> > 
> > 001 (12450.852.000) 04/27 17:08:09 Job executing on host: <129.89.200.78:51017>
> > ...
> > 005 (12450.852.000) 04/27 17:08:14 Job terminated.
> >         (0) Abnormal termination (signal 11)
> >         (0) No core file
> >                 Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
> >                 Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
> >                 Usr 0 01:30:00, Sys 0 00:00:32  -  Total Remote Usage
> >                 Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
> >         304  -  Run Bytes Sent By Job
> >         58917520  -  Run Bytes Received By Job
> >         0  -  Total Bytes Sent By Job
> >         0  -  Total Bytes Received By Job
> > ...
> > 
> > > Yeah--these are the same binaries. Sorry for the confusion. :(
> > 
> > No worries, I still would have probably become superstitious ;-)
> > 
> > > I think we need to see some log files to better help you.
> > 
> > Actually, what's the preferred method of overwhelming you with logs?  
> > Shall I throw them up so as to be http-able?  Or would you prefer email?
> > 
> > Cheers,
> > Paul
> > 
> > 
> > 
> 
> 

-- 
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+ UWM-LSC Group Systems Administrator        parmor@xxxxxxxxxxxxxxxxxxxx +
+ Physics 462                                                            +
+ U. of W. - Milwaukee                                                   +
+ PO Box 413                                                414-229-2677 +
+ Milwaukee, WI 53201                                   fax 414-229-5589 +
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++