[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Condor checkpointing problems?



On Mon, Nov 22, 2004 at 12:03:05PM -0500, Dan Christensen wrote:
> Will Andrews <andrewsw@xxxxxxxxxxxxxxxxxxxxx> writes:
> 
> > At Purdue University we've recently installed a new cluster
> > running RedHat ES3.  It's running Condor 6.6.5 alongside Debian
> > clusters running the same version.  About a week ago a user
> > reported seeing "shadow exception" errors in the logs.  The jobs
> > were unable to checkpoint.  At first we thought it applied to all
> > the nodes, but now we've narrowed it down to only nodes in the
> > new RedHat cluster.
> 
> I don't know if it's related, but we had problems with a node running
> FC1.  Upgrading seems to have fixed it.  The funny thing is, the
> problem depended on a peculiar combination of where the job was
> compiled and where it was run.  Here's a message I sent to
> condor-users in June (message id <87brk0dc2k.fsf@xxxxxx>).  I don't
> recall getting any responses.
> 

We had previously believed that we checkpointed OK on FC1 machines - we
now know that Condor does NOT checkpoint on FC1 machines without some
changes to the machine. 

I _think_ all that you need to do is disable exec_shield:

echo 0 > /proc/sys/kernel/exec-shield

Which removes the address-space randomization that is giving us
some trouble.

We're working on dealing with the new address space layouts, disabling
exec-shield system-wide is not a long-term solution. 

-Erik

> Dan
> 
> Dan Christensen wrote:
> 
> > Alain Roy <roy@xxxxxxxxxxx> writes:
> > 
> > > Richard O'Shaughnessy wrote:
> > >> We recently rebuilt our cluster using fedora core 2.  But while job
> > >> output seems to work (at least, I can see output on some
> > >>jobs), checkpointing doesn't seem to be working correctly:
> > >
> > > I'm not surprised--Fedora Core 2 uses a newer Linux kernel version
> > > (2.6) than we have worked with in Condor.
> > 
> > On our Condor cluster we're having trouble with checkpointing on the
> > one machine which runs Fedora Core 1 (1, not 2).  That machine uses
> > glibc-2.3.2-101.4 with kernel 2.4.22-1.2188.nptlsmp.
> > 
> > The situation is a bit complicated.  Our pool runs a mix of Linuxes.
> > Several machines run Debian testing, several run various versions
> > of RedHat 7.x and 8.0, and just the one above runs FC1.
> > 
> > Almost everything seems to work fine, except that jobs compiled using
> > condor_compile on the Debian machines or on the FC1 machine don't
> > checkpoint when run on the FC1 machine.  They checkpoint on the other
> > RedHat machines and on the Debian machines.  And if I compile my jobs
> > on any of the other RedHat machines, they checkpoint everywhere.
> > 
> > The Debian machines on which I run condor_compile have libc6 2.3.2.
> > And I've tried gcc 3.2.3 and 3.3.3, and both have the same problem.
> > I also tried gcc 2.95, and compilation failed.
> > 
> > The RedHat machines (besides the FC1 machine) have libc6 2.2.5 and gcc
> > "2.96".
> > 
> > All the machines run Condor 6.6.3.
> > 
> > Any thoughts?  We don't see anything useful in the log files.  What
> > debugging options would give more information?
> > 
> > Dan
> _______________________________________________
> Condor-users mailing list
> Condor-users@xxxxxxxxxxx
> http://lists.cs.wisc.edu/mailman/listinfo/condor-users