[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] [CondorLIGO] Condor at AEI -- revisited, now with segfault!



I have also observed that conodr_dagman will segfault under Linux
when strace attaches to the process, e.g., Condor ticket 17526.

Thanks.

On Tue, Apr 08, 2008 at 05:53:47PM +0200, Lucia Santamaria wrote:
> 
> Hi everybody,
> 
> thanks a lot for your answer regarding my evicted jobs; indeed, I should 
> have been more careful and send the logs to a local directory instead of 
> writing to nfs.
> 
> Now we're facing another problem with condor in deepthought, which 
> prevents us from even getting to the point where the previous problem was 
> happening. Now the dags die with unexplained signal 11 ~2-4 min after 
> submission.
> 
> I have located the first job that dies and I have run it with strace after 
> setting
> environment     = 
> _CONDOR_DAGMAN_LOG=ihope.dag.dagman.out;_CONDOR_MAX_DAGMAN_LOG=0
> 
> The call for strace is:
> 
> lucia@deepthought:/scratch/tmp/lucia/play_local/857232370-859651570$ 
> strace /usr/bin/condor_dagman -f -l . -Debug 3 -Lockfile ihope.dag.lock 
> -AutoRescue 1 -DoRescueFrom 0 -Condorlog mylog.log -Dag ihope.dag 
> >strace.out 2>strace.err
> 
> which produces an empty mylog.log file, an empty strace.out file and a 
> non-empty strace.err with a SEGFAULT (aha!).
> You can find it here:
> http://pandora.aei.mpg.de/~lucia/strace.err
> 
> Also the corresponding ihope.dag.dagman.out is here:
> http://pandora.aei.mpg.de/~lucia/ihope.dag.dagman.out
> and you see that there's no error message at the end it simply stops.
> 
> Also I must add that:
> 
> lucia@deepthought:~$ ldd /usr/bin/condor_dagman
>         libdl.so.2 => /lib/libdl.so.2 (0x00002b3137ed9000)
>         libcrypt.so.1 => /lib/libcrypt.so.1 (0x00002b3137fdd000)
>         libresolv.so.2 => /lib/libresolv.so.2 (0x00002b3138111000)
>         libstdc++.so.5 => /usr/lib/libstdc++.so.5 (0x00002b3138226000)
>         libm.so.6 => /lib/libm.so.6 (0x00002b3138403000)
>         libgcc_s.so.1 => /lib/libgcc_s.so.1 (0x00002b3138585000)
>         libc.so.6 => /lib/libc.so.6 (0x00002b3138692000)
>         /lib64/ld-linux-x86-64.so.2 (0x00002b3137dc1000)
> 
> This is the 7.10-pre one, unchanged from Kent's upload,
> dated Mar 19, 5230376 bytes
> 
> I'd like to track down this segfault myself, but you might understand that 
> the output in strace.err scares me a bit.
> 
> Thank you very much for any insight you can provide.
> Lucia
> 
> -- 
> --------------------------------------------
> Lucia Santamaria
> Max-Planck-Institut fuer Gravitationsphysik
> Albert-Einstein-Institut
> Am Muehlenberg 1, 17746 Golm, Germany
> Office: +49(0)331-567-7181
> ---------------------------------------------
> _______________________________________________
> Condorligo mailing list
> Condorligo@xxxxxxxxxx
> http://lists.aei.mpg.de/cgi-bin/mailman/listinfo/condorligo

-- 
Stuart Anderson  anderson@xxxxxxxxxxxxxxxx
http://www.ligo.caltech.edu/~anderson