[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Restart from checkpoint failing for HTCondor 8.4.1



On 11/4/2015 1:15 PM, Feldt, Andrew N. wrote:


11/02/15 11:18:33 (8.0) (2889688):Read: Opened "/var/lib/condor/spool/8/0/cluster8.proc0.subproc0" via file stream
11/02/15 11:18:33 (8.0) (2889688):Read: Read headers OK
11/02/15 11:18:33 (8.0) (2889688):Read: Read SegMap[0](DATA) OK
11/02/15 11:18:33 (8.0) (2889688):Read: Read SegMap[1](STACK) OK
11/02/15 11:18:33 (8.0) (2889688):Read: Read all SegMaps OK
11/02/15 11:18:33 (8.0) (2889688):Read: Found a DATA block, increasing heap from 0x887000 to 0x986000
11/02/15 11:18:33 (8.0) (2889688):Read: About to overwrite 1789952 bytes starting at 0x7d1000(DATA)
11/02/15 11:18:33 (8.0) (2889688):Reaped child status - pid 2889690 exited with status 0
11/02/15 11:18:33 (8.0) (2889688):Read: *** longjmp causes uninitialized stack frame ***: condor_exec.8.0 terminated


I think "longjmp causes uninitialized stack frame" is coming from GCC's fortify source compiler options.

So you are running universe=standard jobs on HTCondor v8.4.1 on RHEL 6.7. Some questions -

- Is this failure on restart happening at your site for ALL standard universe jobs? Or just consistently for certain jobs? Or only occasionally? If the latter, ~ how many jobs get stuck on restart - 5%, 50%, 90%, or?

- Where did you get your HTCondor binaries from? Options include RPM downloaded from htcondor.org, or RPMs from EPEL, self compiled from source, or?

 - Could you send along the output from condor_version ?

thanks
Todd