[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Restart from checkpoint failing for HTCondor 8.4.1



> On Nov 5, 2015, at 2:22 PM, Feldt, Andrew N. <afeldt@xxxxxx> wrote:
> 
>> 
>> On Nov 5, 2015, at 2:07 PM, Todd Tannenbaum <tannenba@xxxxxxxxxxx> wrote:
>> 
>> On 11/4/2015 1:15 PM, Feldt, Andrew N. wrote:
>> 
>>>> 
>>>> 11/02/15 11:18:33 (8.0) (2889688):Read: Opened "/var/lib/condor/spool/8/0/cluster8.proc0.subproc0" via file stream
>>>> 11/02/15 11:18:33 (8.0) (2889688):Read: Read headers OK
>>>> 11/02/15 11:18:33 (8.0) (2889688):Read: Read SegMap[0](DATA) OK
>>>> 11/02/15 11:18:33 (8.0) (2889688):Read: Read SegMap[1](STACK) OK
>>>> 11/02/15 11:18:33 (8.0) (2889688):Read: Read all SegMaps OK
>>>> 11/02/15 11:18:33 (8.0) (2889688):Read: Found a DATA block, increasing heap from 0x887000 to 0x986000
>>>> 11/02/15 11:18:33 (8.0) (2889688):Read: About to overwrite 1789952 bytes starting at 0x7d1000(DATA)
>>>> 11/02/15 11:18:33 (8.0) (2889688):Reaped child status - pid 2889690 exited with status 0
>>>> 11/02/15 11:18:33 (8.0) (2889688):Read: *** longjmp causes uninitialized stack frame ***: condor_exec.8.0 terminated
>>>> 
>> 
>> I think "longjmp causes uninitialized stack frame" is coming from GCC's fortify source compiler options.
>> 
>> So you are running universe=standard jobs on HTCondor v8.4.1 on RHEL 6.7.  Some questions -
>> 
>> - Is this failure on restart happening at your site for ALL standard universe jobs?  Or just consistently for certain jobs?  Or only occasionally?  If the latter, ~ how many jobs get stuck on restart - 5%, 50%, 90%, or?
>> 
>> - Where did you get your HTCondor binaries from?  Options include RPM downloaded from htcondor.org, or RPMs from EPEL, self compiled from source, or?
>> 
>> - Could you send along the output from condor_version ?
>> 
>> thanks
>> Todd
>> 
> 
> Todd,
> 
> Yes, universe=standard jobs on HTCondor v8.4.1 on RHEL 6.7
> 
> 1 - This happens for ALL standard universe jobs which get vacated.
> 2 - The HTCondor binaries are from the repo at http://www.cswisc.edu/condor/yum/stable/rhel6
> 3 -
> $CondorVersion: 8.4.1 Oct 26 2015 BuildID: 346648 $
> $CondorPlatform: X86_64-RedHat_6.7 $
> 
> Note that I have now turned off all configuration for vacating jobs and no longer run the condor_kbdd so that the faculty member running parallel jobs can have them run (they run for  3-4 months).
> 
> I can make this happen by submitting a job which I have compiled with the current condor_compile and forcing it to run on an unused system and then vacating it.  It dies instead of moving to another system.
> 
> Andy

Todd,

We have now reverted to condor-8.2.10-345812 for our production HTCondor pool.  This is allowing our jobs to properly vacate as needed.  (This is from the htcondor-previous repo.)  I will be interested in future updates to the 8.4 series which may address the checkpoint-restart problem.

Andy