[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Condor checkpointing problems?



Will Andrews <andrewsw@xxxxxxxxxxxxxxxxxxxxx> writes:

> At Purdue University we've recently installed a new cluster
> running RedHat ES3.  It's running Condor 6.6.5 alongside Debian
> clusters running the same version.  About a week ago a user
> reported seeing "shadow exception" errors in the logs.  The jobs
> were unable to checkpoint.  At first we thought it applied to all
> the nodes, but now we've narrowed it down to only nodes in the
> new RedHat cluster.

I don't know if it's related, but we had problems with a node running
FC1.  Upgrading seems to have fixed it.  The funny thing is, the
problem depended on a peculiar combination of where the job was
compiled and where it was run.  Here's a message I sent to
condor-users in June (message id <87brk0dc2k.fsf@xxxxxx>).  I don't
recall getting any responses.

Dan

Dan Christensen wrote:

> Alain Roy <roy@xxxxxxxxxxx> writes:
> 
> > Richard O'Shaughnessy wrote:
> >> We recently rebuilt our cluster using fedora core 2.  But while job
> >> output seems to work (at least, I can see output on some
> >>jobs), checkpointing doesn't seem to be working correctly:
> >
> > I'm not surprised--Fedora Core 2 uses a newer Linux kernel version
> > (2.6) than we have worked with in Condor.
> 
> On our Condor cluster we're having trouble with checkpointing on the
> one machine which runs Fedora Core 1 (1, not 2).  That machine uses
> glibc-2.3.2-101.4 with kernel 2.4.22-1.2188.nptlsmp.
> 
> The situation is a bit complicated.  Our pool runs a mix of Linuxes.
> Several machines run Debian testing, several run various versions
> of RedHat 7.x and 8.0, and just the one above runs FC1.
> 
> Almost everything seems to work fine, except that jobs compiled using
> condor_compile on the Debian machines or on the FC1 machine don't
> checkpoint when run on the FC1 machine.  They checkpoint on the other
> RedHat machines and on the Debian machines.  And if I compile my jobs
> on any of the other RedHat machines, they checkpoint everywhere.
> 
> The Debian machines on which I run condor_compile have libc6 2.3.2.
> And I've tried gcc 3.2.3 and 3.3.3, and both have the same problem.
> I also tried gcc 2.95, and compilation failed.
> 
> The RedHat machines (besides the FC1 machine) have libc6 2.2.5 and gcc
> "2.96".
> 
> All the machines run Condor 6.6.3.
> 
> Any thoughts?  We don't see anything useful in the log files.  What
> debugging options would give more information?
> 
> Dan