[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Problems with checkpointing.




I built an RPM that we've installed on all of the RH9 boxes which consists
of the condor-6.7.6-dynamic tarball for RH9 and chkpt_server-6.7.0.

I built an RPM that we've installed on the FC3 worker nodes, which
consists of condor-6.7.6-dynamic for FC1 and chkpt_server-6.7.0.

I would think that these would be identical RPMs, since we don't distribute different binaries for RedHat 9, Fedora Core 1, or Fedora Core 3: We build it on RedHat 9 and it just works on the Fedora Core 1-3. I know that the download web page lists them separately--this is to make it clear what to download. But they are identical.


I'm also a bit confused--you're installing the checkpoint server on all the execution computers?

This didn't work, and both individually submitted jobs and jobs from dag's
would fail (SIG 11, iirc) when trying to restart after having chkpointed
and being evicted.

Can you be more specific about the errors you are getting?

I then tried building an RPM for the FC3 nodes which consists of the
condor-6.7.6-dynamic for RH9 and chkpt_server-6.7.0, and am seeing the
same failures.

Yeah--these are the same binaries. Sorry for the confusion. :(

I'm waiting for user help digging through their log files to get a better
understanding of where failed jobs ran initially and where they died; but
'til I get such feedback, does anyone have any suggestions?

I think we need to see some log files to better help you.

-alain