Mailing List Archives
Public Access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[HTCondor-users] starter segfault
- Date: Fri, 7 Dec 2012 15:17:58 +0000
- From: Michael John Breza <mjb04@xxxxxxxxxxxx>
- Subject: [HTCondor-users] starter segfault
Hello,
Has anybody had a problem with starter segfaulting while executing
jobs in the standard universe?
The job is a simple c program compiled with condor_compile and gcc.
During compilation I receive:
warning: Using 'getaddrinfo' in statically linked applications requires at runtime the shared
libraries from the glibc version used for linking
My submit file specifies the same OS and Arch so the warning above
shouldn't be a problem.
Executable = /homes/mjb04/Condor/standard_universe_test/armstrong_number_finder
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
Universe = standard
Output = cputest.$(Process).out
Error = cputest.$(Process).err
Log = cputest.$(Process).log
Requirements = DoC_OS_Distribution == "Ubuntu" && \
DoC_OS_Release == "12.04" && \
Arch == "X86_64"
Queue 5
When I run the job, condor_q shows that the jobs are idle. Every once
in a while, (I assume when the matchmaker does its thing), the jobs
run for a second or two, and then get put back into idle.
When I do condor_q -analyse it reports:
024.000: Request has not yet been considered by the matchmaker.
When I look at a log file, i.e cputest.0.log, it reports:
...
001 (024.000.000) 12/06 22:16:17 Job executing on host:
<123.123.5.37:50497?sock=18345_244b_3>
...
007 (024.000.000) 12/06 22:16:17 Shadow exception!
Unable to talk to job: disconnected
86 - Run Bytes Sent By Job
48 - Run Bytes Received By Job
I look under /var/log/condor/ShadowLog and see:
12/07/12 14:09:22 (pid:11817) (24.1) (11817):FileLock object is
updating timestamp on: /tmp/condorLocks/74/21/363588223312040.lockc
12/07/12 14:09:22 (pid:11817) (24.1) (11817):UserLog =
/homes/mjb04/Condor/standard_universe_test/cputest.1.log
12/07/12 14:09:22 (pid:11817) (24.1) (11817):My_Filesystem_Domain =
"doc.ic.ac.uk"
12/07/12 14:09:22 (pid:11817) (24.1) (11817):My_UID_Domain =
"doc.ic.ac.uk"
12/07/12 14:09:22 (pid:11817) (24.1) (11817):HandleSyscalls: about to
chdir(/homes/mjb04/Condor/standard_universe_test)
12/07/12 14:09:22 (pid:11817) (24.1) (11817):Shadow: Starting to field
syscall requests
12/07/12 14:09:22 (pid:11817) (24.1) (11817):Got request for syscall
-34 <CONDOR_register_fs_domain>
12/07/12 14:09:22 (pid:11817) (24.1) (11817): FS_Domain =
"doc.ic.ac.uk"
12/07/12 14:09:22 (pid:11817) (24.1) (11817): ret_val = 0, errno = 0
12/07/12 14:09:22 (pid:11817) (24.1) (11817):Got request for syscall
-33 <CONDOR_register_uid_domain>
12/07/12 14:09:22 (pid:11817) (24.1) (11817): UID_Domain =
"doc.ic.ac.uk"
12/07/12 14:09:22 (pid:11817) (24.1) (11817): ret_val = 0, errno = 0
12/07/12 14:09:22 (pid:11817) (24.1) (11817):Got request for syscall
-80 <CONDOR_register_ckpt_platform>
12/07/12 14:09:22 (pid:11817) (24.1) (11817): len = 30
12/07/12 14:09:22 (pid:11817) (24.1) (11817): ret_val = 0, errno = 0
12/07/12 14:09:22 (pid:11817) (24.1) (11817):Got request for syscall
-58 <CONDOR_register_ckpt_server>
12/07/12 14:09:22 (pid:11817) (24.1) (11817): ret_val = 0, errno = 0
12/07/12 14:09:22 (pid:11817) (24.1) (11817):Got request for syscall
-59 <CONDOR_register_arch>
12/07/12 14:09:22 (pid:11817) (24.1) (11817): ret_val = 0, errno = 0
12/07/12 14:09:22 (pid:11817) (24.1) (11817):Got request for syscall
-60 <CONDOR_register_opsys>
12/07/12 14:09:22 (pid:11817) (24.1) (11817): ret_val = 0, errno = 0
12/07/12 14:09:22 (pid:11817) (24.1) (11817):condor_read(): Socket
closed when trying to read 5 bytes from
12/07/12 14:09:22 (pid:11817) (24.1) (11817):IO: EOF reading packet
header
12/07/12 14:09:22 (pid:11817) (24.1) (11817):ERROR "Unable to talk to
job: disconnected" at line 135 in file
/slots/01/dir_16105/userdir/src/condor_syscall_lib/receivers.cpp
12/07/12 14:09:22 (pid:11817) (24.1) (11817):FileLock::obtain(1) -
@1354889362.732862 lock on /tmp/condorLocks/74/21/363588223312040.lockc now WRITE
...
12/07/12 14:09:22 (pid:11817) (24.1) (11817):Shadow: Entered
DoCleanup()
12/07/12 14:09:22 (pid:11817) (24.1) (11817):Shadow: DoCleanup:
unlinking TmpCkpt
'/var/spool/condor/24/1/cluster24.proc1.subproc0.tmp'
12/07/12 14:09:22 (pid:11817) (24.1) (11817):Trying to unlink
/var/spool/condor/24/1/cluster24.proc1.subproc0.tmp
12/07/12 14:09:22 (pid:11817) (24.1) (11817):Can't get address for
checkpoint server host (NULL): Success
12/07/12 14:09:22 (pid:11817) (24.1) (11817):Remove from ckpt server
returns -1
Then, I ssh to the executing host, and find no mention of the
submitting computers hostname or IP address in any of the
/var/log/condor/StarterLog(s).
Then, I look at the output of dmesg on the executing host, and it
reads:
[939511.146750] condor_starter.[13907]: segfault at 0 ip
00007ff364b6abde sp 00007fff4350fc80 error 4 in
libcondor_utils_7_8_6.so[7ff3649cd000+34c000]
repeated line after line after line.
It appears that the starter is segfaulting before it has a chance to
write to its logs.
So, does anyone know what is causing this error from the information I
have supplied here? Is it a problem with starter, or is it a
configuration problem. One thought I have had is that the starter
cannot communicate with the submitter's shadow, and so segfaults.
Jobs submitted using the vanilla universe execute with no problems. It
is only the standard universe which has these problems.
Any help or suggestions would be appreciated.
cheers,
Michael Breza