[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Problems with checkpointing.



Hi,
we're having problems with jobs that have checkpointed and evicted failing 
when they restart.

Our set up is a little strange, but we've been able to "get away with" a 
similar migration plan in the past...

We have a mixed environment of RH9 and FC3 machines.  The submit machines, 
central mgr, and the majority of the worker nodes are RH9.

We've upgraded some of our worker nodes to FC3, and initially we had the 
following:

I built an RPM that we've installed on all of the RH9 boxes which consists 
of the condor-6.7.6-dynamic tarball for RH9 and chkpt_server-6.7.0.

I built an RPM that we've installed on the FC3 worker nodes, which 
consists of condor-6.7.6-dynamic for FC1 and chkpt_server-6.7.0.

This didn't work, and both individually submitted jobs and jobs from dag's 
would fail (SIG 11, iirc) when trying to restart after having chkpointed 
and being evicted.


I then tried building an RPM for the FC3 nodes which consists of the 
condor-6.7.6-dynamic for RH9 and chkpt_server-6.7.0, and am seeing the 
same failures.

I'm waiting for user help digging through their log files to get a better 
understanding of where failed jobs ran initially and where they died; but 
'til I get such feedback, does anyone have any suggestions?

Thanks!
Paul Armor