[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Restart from checkpoint failing for HTCondor 8.4.1



On 11/10/2015 11:03 AM, Feldt, Andrew N. wrote:


Todd,

We have now reverted to condor-8.2.10-345812 for our production
HTCondor pool.  This is allowing our jobs to properly vacate as
needed.  (This is from the htcondor-previous repo.)  I will be
interested in future updates to the 8.4 series which may address the
checkpoint-restart problem.

Andy


Hi Andy,

We think we now know what is happening and how to fix it.

I am guessing that your v8.4 attempt was using binaries from the RPM package?

Our thinking is that the v8.4 binaries contained in the tarball would work, but the v8.4 binaries in the RPM packages would fail (with respect to standard universe restart). This is because our tarball binaries are built with cmake, and our RPM packages are built via rpmbuild calling out to cmake. The issue is rpmbuild sneaks in a bunch of additional and undesired compiler flags. We are working to fix this issue for the upcoming HTCondor v8.4.2 release. Follow progress and see details at:
  https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=5382

Thank you for bringing this to our attention!!

Also, I am always curious how folks are using standard universe... could you share a brief description of the sort of jobs (i.e. what application, what scientific domain, etc) that are using standard universe at Univ of Oklahoma?

best regards
Todd