[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] [CondorLIGO] Condor at AEI -- revisited, now with segfault!




Hi Stuart,

I have also observed that conodr_dagman will segfault under Linux
when strace attaches to the process, e.g., Condor ticket 17526.

I have now read the ticket you refer to, very enlightening.
Does this mean that I can not debug the job that is causing me problems?
I'm sure there must be some other tool besides strace that I can use to look at my killed jobs.

Thanks,
Lucia


Thanks.

On Tue, Apr 08, 2008 at 05:53:47PM +0200, Lucia Santamaria wrote:

Hi everybody,

thanks a lot for your answer regarding my evicted jobs; indeed, I should
have been more careful and send the logs to a local directory instead of
writing to nfs.

Now we're facing another problem with condor in deepthought, which
prevents us from even getting to the point where the previous problem was
happening. Now the dags die with unexplained signal 11 ~2-4 min after
submission.

I have located the first job that dies and I have run it with strace after
setting
environment     =
_CONDOR_DAGMAN_LOG=ihope.dag.dagman.out;_CONDOR_MAX_DAGMAN_LOG=0

The call for strace is:

lucia@deepthought:/scratch/tmp/lucia/play_local/857232370-859651570$
strace /usr/bin/condor_dagman -f -l . -Debug 3 -Lockfile ihope.dag.lock
-AutoRescue 1 -DoRescueFrom 0 -Condorlog mylog.log -Dag ihope.dag
strace.out 2>strace.err

which produces an empty mylog.log file, an empty strace.out file and a
non-empty strace.err with a SEGFAULT (aha!).
You can find it here:
http://pandora.aei.mpg.de/~lucia/strace.err

Also the corresponding ihope.dag.dagman.out is here:
http://pandora.aei.mpg.de/~lucia/ihope.dag.dagman.out
and you see that there's no error message at the end it simply stops.

Also I must add that:

lucia@deepthought:~$ ldd /usr/bin/condor_dagman
        libdl.so.2 => /lib/libdl.so.2 (0x00002b3137ed9000)
        libcrypt.so.1 => /lib/libcrypt.so.1 (0x00002b3137fdd000)
        libresolv.so.2 => /lib/libresolv.so.2 (0x00002b3138111000)
        libstdc++.so.5 => /usr/lib/libstdc++.so.5 (0x00002b3138226000)
        libm.so.6 => /lib/libm.so.6 (0x00002b3138403000)
        libgcc_s.so.1 => /lib/libgcc_s.so.1 (0x00002b3138585000)
        libc.so.6 => /lib/libc.so.6 (0x00002b3138692000)
        /lib64/ld-linux-x86-64.so.2 (0x00002b3137dc1000)

This is the 7.10-pre one, unchanged from Kent's upload,
dated Mar 19, 5230376 bytes

I'd like to track down this segfault myself, but you might understand that
the output in strace.err scares me a bit.

Thank you very much for any insight you can provide.
Lucia

--
--------------------------------------------
Lucia Santamaria
Max-Planck-Institut fuer Gravitationsphysik
Albert-Einstein-Institut
Am Muehlenberg 1, 17746 Golm, Germany
Office: +49(0)331-567-7181
---------------------------------------------
_______________________________________________
Condorligo mailing list
Condorligo@xxxxxxxxxx
http://lists.aei.mpg.de/cgi-bin/mailman/listinfo/condorligo

--
Stuart Anderson  anderson@xxxxxxxxxxxxxxxx
http://www.ligo.caltech.edu/~anderson


--------------------------------------------
Lucia Santamaria
Max-Planck-Institut fuer Gravitationsphysik
Albert-Einstein-Institut
Am Muehlenberg 1, 17746 Golm, Germany
Office: +49(0)331-567-7181
---------------------------------------------