[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Condor at AEI -- revisited, now with segfault!




Hi everybody,

thanks a lot for your answer regarding my evicted jobs; indeed, I should have been more careful and send the logs to a local directory instead of writing to nfs.

Now we're facing another problem with condor in deepthought, which prevents us from even getting to the point where the previous problem was happening. Now the dags die with unexplained signal 11 ~2-4 min after submission.

I have located the first job that dies and I have run it with strace after setting environment = _CONDOR_DAGMAN_LOG=ihope.dag.dagman.out;_CONDOR_MAX_DAGMAN_LOG=0

The call for strace is:

lucia@deepthought:/scratch/tmp/lucia/play_local/857232370-859651570$ strace /usr/bin/condor_dagman -f -l . -Debug 3 -Lockfile ihope.dag.lock -AutoRescue 1 -DoRescueFrom 0 -Condorlog mylog.log -Dag ihope.dag
strace.out 2>strace.err

which produces an empty mylog.log file, an empty strace.out file and a non-empty strace.err with a SEGFAULT (aha!).
You can find it here:
http://pandora.aei.mpg.de/~lucia/strace.err

Also the corresponding ihope.dag.dagman.out is here:
http://pandora.aei.mpg.de/~lucia/ihope.dag.dagman.out
and you see that there's no error message at the end it simply stops.

Also I must add that:

lucia@deepthought:~$ ldd /usr/bin/condor_dagman
        libdl.so.2 => /lib/libdl.so.2 (0x00002b3137ed9000)
        libcrypt.so.1 => /lib/libcrypt.so.1 (0x00002b3137fdd000)
        libresolv.so.2 => /lib/libresolv.so.2 (0x00002b3138111000)
        libstdc++.so.5 => /usr/lib/libstdc++.so.5 (0x00002b3138226000)
        libm.so.6 => /lib/libm.so.6 (0x00002b3138403000)
        libgcc_s.so.1 => /lib/libgcc_s.so.1 (0x00002b3138585000)
        libc.so.6 => /lib/libc.so.6 (0x00002b3138692000)
        /lib64/ld-linux-x86-64.so.2 (0x00002b3137dc1000)

This is the 7.10-pre one, unchanged from Kent's upload,
dated Mar 19, 5230376 bytes

I'd like to track down this segfault myself, but you might understand that the output in strace.err scares me a bit.

Thank you very much for any insight you can provide.
Lucia

--
--------------------------------------------
Lucia Santamaria
Max-Planck-Institut fuer Gravitationsphysik
Albert-Einstein-Institut
Am Muehlenberg 1, 17746 Golm, Germany
Office: +49(0)331-567-7181
---------------------------------------------