[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] starter segfault



Hello,

Has anybody had a problem with starter segfaulting while executing
jobs in the standard universe?

The job is a simple c program compiled with condor_compile and gcc.

During compilation I receive: 

warning: Using 'getaddrinfo' in statically linked applications requires at runtime the shared
libraries from the glibc version used for linking

My submit file specifies the same OS and Arch so the warning above
shouldn't be a problem.

Executable = /homes/mjb04/Condor/standard_universe_test/armstrong_number_finder

should_transfer_files = YES

when_to_transfer_output = ON_EXIT
Universe       = standard
Output         = cputest.$(Process).out
Error          = cputest.$(Process).err
Log            = cputest.$(Process).log 

Requirements    = DoC_OS_Distribution == "Ubuntu" && \
                DoC_OS_Release == "12.04" && \
                Arch == "X86_64"
Queue 5

When I run the job, condor_q shows that the jobs are idle. Every once
in a while, (I assume when the matchmaker does its thing), the jobs
run for a second or two, and then get put back into idle.

When I do condor_q -analyse it reports:

024.000:  Request has not yet been considered by the matchmaker.

When I look at a log file, i.e cputest.0.log, it reports:

...
001 (024.000.000) 12/06 22:16:17 Job executing on host:
<123.123.5.37:50497?sock=18345_244b_3>
...
007 (024.000.000) 12/06 22:16:17 Shadow exception!
        Unable to talk to job: disconnected

        86  -  Run Bytes Sent By Job
        48  -  Run Bytes Received By Job


I look under /var/log/condor/ShadowLog and see:

12/07/12 14:09:22 (pid:11817) (24.1) (11817):FileLock object is
updating timestamp on: /tmp/condorLocks/74/21/363588223312040.lockc

12/07/12 14:09:22 (pid:11817) (24.1) (11817):UserLog =
/homes/mjb04/Condor/standard_universe_test/cputest.1.log

12/07/12 14:09:22 (pid:11817) (24.1) (11817):My_Filesystem_Domain =
"doc.ic.ac.uk"

12/07/12 14:09:22 (pid:11817) (24.1) (11817):My_UID_Domain =
"doc.ic.ac.uk"

12/07/12 14:09:22 (pid:11817) (24.1) (11817):HandleSyscalls: about to
chdir(/homes/mjb04/Condor/standard_universe_test)

12/07/12 14:09:22 (pid:11817) (24.1) (11817):Shadow: Starting to field
syscall requests

12/07/12 14:09:22 (pid:11817) (24.1) (11817):Got request for syscall
-34 <CONDOR_register_fs_domain>

12/07/12 14:09:22 (pid:11817) (24.1) (11817):   FS_Domain =
"doc.ic.ac.uk"

12/07/12 14:09:22 (pid:11817) (24.1) (11817):   ret_val = 0, errno = 0

12/07/12 14:09:22 (pid:11817) (24.1) (11817):Got request for syscall
-33 <CONDOR_register_uid_domain>

12/07/12 14:09:22 (pid:11817) (24.1) (11817):   UID_Domain =
"doc.ic.ac.uk"

12/07/12 14:09:22 (pid:11817) (24.1) (11817):   ret_val = 0, errno = 0

12/07/12 14:09:22 (pid:11817) (24.1) (11817):Got request for syscall
-80 <CONDOR_register_ckpt_platform>

12/07/12 14:09:22 (pid:11817) (24.1) (11817):   len = 30

12/07/12 14:09:22 (pid:11817) (24.1) (11817):   ret_val = 0, errno = 0

12/07/12 14:09:22 (pid:11817) (24.1) (11817):Got request for syscall
-58 <CONDOR_register_ckpt_server>

12/07/12 14:09:22 (pid:11817) (24.1) (11817):   ret_val = 0, errno = 0

12/07/12 14:09:22 (pid:11817) (24.1) (11817):Got request for syscall
-59 <CONDOR_register_arch>

12/07/12 14:09:22 (pid:11817) (24.1) (11817):   ret_val = 0, errno = 0

12/07/12 14:09:22 (pid:11817) (24.1) (11817):Got request for syscall
-60 <CONDOR_register_opsys>

12/07/12 14:09:22 (pid:11817) (24.1) (11817):   ret_val = 0, errno = 0

12/07/12 14:09:22 (pid:11817) (24.1) (11817):condor_read(): Socket
closed when trying to read 5 bytes from 

12/07/12 14:09:22 (pid:11817) (24.1) (11817):IO: EOF reading packet
header

12/07/12 14:09:22 (pid:11817) (24.1) (11817):ERROR "Unable to talk to
job: disconnected" at line 135 in file
/slots/01/dir_16105/userdir/src/condor_syscall_lib/receivers.cpp

12/07/12 14:09:22 (pid:11817) (24.1) (11817):FileLock::obtain(1) -
@1354889362.732862 lock on /tmp/condorLocks/74/21/363588223312040.lockc now WRITE

...

12/07/12 14:09:22 (pid:11817) (24.1) (11817):Shadow: Entered
DoCleanup()

12/07/12 14:09:22 (pid:11817) (24.1) (11817):Shadow: DoCleanup:
unlinking TmpCkpt
'/var/spool/condor/24/1/cluster24.proc1.subproc0.tmp'

12/07/12 14:09:22 (pid:11817) (24.1) (11817):Trying to unlink
/var/spool/condor/24/1/cluster24.proc1.subproc0.tmp

12/07/12 14:09:22 (pid:11817) (24.1) (11817):Can't get address for
checkpoint server host (NULL): Success

12/07/12 14:09:22 (pid:11817) (24.1) (11817):Remove from ckpt server
returns -1

Then, I ssh to the executing host, and find no mention of the
submitting computers hostname or IP address in any of the
/var/log/condor/StarterLog(s).  

Then, I look at the output of dmesg on the executing host, and it
reads:

[939511.146750] condor_starter.[13907]: segfault at 0 ip
00007ff364b6abde sp 00007fff4350fc80 error 4 in
libcondor_utils_7_8_6.so[7ff3649cd000+34c000]

repeated line after line after line.

It appears that the starter is segfaulting before it has a chance to
write to its logs.

So, does anyone know what is causing this error from the information I
have supplied here? Is it a problem with starter, or is it a
configuration problem. One thought I have had is that the starter
cannot communicate with the submitter's shadow, and so segfaults. 

Jobs submitted using the vanilla universe execute with no problems. It
is only the standard universe which has these problems.

Any help or suggestions would be appreciated.

cheers,

Michael Breza