[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Segmentation fault problem in Vanilla Universe - Program runs fine outside condor...



Hi Guys,

These jobs seem to run fine manually on both x86_64 and INTEL architectures on both RHEL4 and RedHat7. However they fail with a core dump when run under condor.
Logs enclosed below:

I refer specifically to:

6/26 17:20:40 Create_Process succeeded, pid=7907
6/26 17:21:49 Process exited, pid=7907, status=174
6/26 17:21:49 Got SIGQUIT.  Performing fast shutdown.

and

6/26 17:21:49 Error: can't find resource with capability (<172.16.50.10:9801>#1994453280)

Anyone have any ideas?

Any help is MUCH appreciated :-D
Many thanks
Jon Rea



From the Run: The problem ... The application output is of course truncated at the point of the segmentation fault...

In file: 16PK__111_6.err

forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source condor_exec.exe 082A0B99 Unknown Unknown Unknown

Stack trace terminated abnormally.




From Condor:

Here is the Job file:

universe = vanilla
should_transfer_files = YES
WhenToTransferOutput = ON_EXIT_OR_EVICT
requirements = (OpSys == "LINUX" && Arch == "INTEL" && Machine == "morticia")
Executable = /shared_mount/plop/plop
Log = log.out
transfer_input_files=../sg/16PK__111_6.con
Input = ../sg/16PK__111_6.path
Output = 16PK__111_6.out
Error = 16PK__111_6.err
Queue

Master log

6/26 17:19:48 ******************************************************
6/26 17:19:48 ** condor_master (CONDOR_MASTER) STARTING UP
6/26 17:19:48 ** /opt/condor-6.6.10/sbin/condor_master
6/26 17:19:48 ** $CondorVersion: 6.6.10 Jun 13 2005 $
6/26 17:19:48 ** $CondorPlatform: I386-LINUX_RH9 $
6/26 17:19:48 ** PID = 7892
6/26 17:19:48 ******************************************************
6/26 17:19:48 Using config file: /condor/condor_config
6/26 17:19:48 Using local config files: /condor/condor_config.local
6/26 17:19:48 DaemonCore: Command Socket at <172.16.50.10:9746>
6/26 17:19:48 Started DaemonCore process "/opt/condor-6.6.10/sbin/condor_startd", pid and pgroup = 7893 6/26 17:19:48 Started DaemonCore process "/opt/condor-6.6.10/sbin/condor_schedd", pid and pgroup = 7895

Sched log

6/26 17:19:48 ******************************************************
6/26 17:19:48 ** condor_schedd (CONDOR_SCHEDD) STARTING UP
6/26 17:19:48 ** /opt/condor-6.6.10/sbin/condor_schedd
6/26 17:19:48 ** $CondorVersion: 6.6.10 Jun 13 2005 $
6/26 17:19:48 ** $CondorPlatform: I386-LINUX_RH9 $
6/26 17:19:48 ** PID = 7895
6/26 17:19:48 ******************************************************
6/26 17:19:48 Using config file: /condor/condor_config
6/26 17:19:48 Using local config files: /condor/condor_config.local
6/26 17:19:48 DaemonCore: Command Socket at <172.16.50.10:9839>
6/26 17:20:26 DaemonCore: Command received via UDP from host <172.16.50.10:9980> 6/26 17:20:26 DaemonCore: received command 421 (RESCHEDULE), calling handler (reschedule_negotiator)
6/26 17:20:26 Sent ad to central manager for jr0407@xxxxxxxxxxxxxx
6/26 17:20:26 Called reschedule_negotiator()
6/26 17:20:36 DaemonCore: Command received via TCP from host <172.16.50.11:9702> 6/26 17:20:36 DaemonCore: received command 416 (NEGOTIATE), calling handler (negotiate)
6/26 17:20:36 Negotiating for owner: jr0407@xxxxxxxxxxxxxx
6/26 17:20:36 Checking consistency running and runnable jobs
6/26 17:20:36 Tables are consistent
6/26 17:20:36 Out of jobs - 1 jobs matched, 0 jobs idle, flock level = 0
6/26 17:20:36 Sent ad to central manager for jr0407@xxxxxxxxxxxxxx
6/26 17:20:38 Started shadow for job 32.0 on "<172.16.50.10:9801>", (shadow pid = 7904)
6/26 17:20:41 Sent ad to central manager for jr0407@xxxxxxxxxxxxxx
6/26 17:21:49 Shadow pid 7904 for job 32.0 exited with status 100
6/26 17:21:49 match (<172.16.50.10:9801>#1994453280) out of jobs (cluster id 32); relinquishing
6/26 17:21:49 Sent RELEASE_CLAIM to startd on <172.16.50.10:9801>
6/26 17:21:49 Match record (<172.16.50.10:9801>, 32, -1) deleted
6/26 17:21:49 DaemonCore: Command received via TCP from host <172.16.50.10:9956> 6/26 17:21:49 DaemonCore: received command 443 (VACATE_SERVICE), calling handler (vacate_service)
6/26 17:21:49 Got VACATE_SERVICE from <172.16.50.10:9956>
6/26 17:25:41 Sent owner (0 jobs) ad to central manager

Shaddow log

6/26 17:20:38 ******************************************************
6/26 17:20:38 ** condor_shadow (CONDOR_SHADOW) STARTING UP
6/26 17:20:38 ** /opt/condor-6.6.10/sbin/condor_shadow
6/26 17:20:38 ** $CondorVersion: 6.6.10 Jun 13 2005 $
6/26 17:20:38 ** $CondorPlatform: I386-LINUX_RH9 $
6/26 17:20:38 ** PID = 7904
6/26 17:20:38 ******************************************************
6/26 17:20:38 Using config file: /condor/condor_config
6/26 17:20:38 Using local config files: /condor/condor_config.local
6/26 17:20:38 DaemonCore: Command Socket at <172.16.50.10:9980>
6/26 17:20:39 Initializing a VANILLA shadow
6/26 17:20:39 (32.0) (7904): Request to run on <172.16.50.10:9801> was ACCEPTED
6/26 17:21:49 (32.0) (7904): Job 32.0 terminated: exited with status 174
6/26 17:21:49 (32.0) (7904): **** condor_shadow (condor_SHADOW) EXITING WITH STATUS 100

Starter log

6/26 17:20:39 ******************************************************
6/26 17:20:39 ** condor_starter (CONDOR_STARTER) STARTING UP
6/26 17:20:39 ** /opt/condor-6.6.10/sbin/condor_starter
6/26 17:20:39 ** $CondorVersion: 6.6.10 Jun 13 2005 $
6/26 17:20:39 ** $CondorPlatform: I386-LINUX_RH9 $
6/26 17:20:39 ** PID = 7905
6/26 17:20:39 ******************************************************
6/26 17:20:39 Using config file: /condor/condor_config
6/26 17:20:39 Using local config files: /condor/condor_config.local
6/26 17:20:39 DaemonCore: Command Socket at <172.16.50.10:9883>
6/26 17:20:39 Done setting resource limits
6/26 17:20:39 Starter communicating with condor_shadow <172.16.50.10:9980>
6/26 17:20:39 Submitting machine is "morticia.clust"
6/26 17:20:39 File transfer completed successfully.
6/26 17:20:40 Starting a VANILLA universe job with ID: 32.0
6/26 17:20:40 IWD: /condor/execute/dir_7905
6/26 17:20:40 Input file: /condor/execute/dir_7905/16PK__111_6.path
6/26 17:20:40 Output file: /condor/execute/dir_7905/16PK__111_6.out
6/26 17:20:40 Error file: /condor/execute/dir_7905/16PK__111_6.err
6/26 17:20:40 About to exec /condor/execute/dir_7905/condor_exec.exe
6/26 17:20:40 Create_Process succeeded, pid=7907
6/26 17:21:49 Process exited, pid=7907, status=174
6/26 17:21:49 Got SIGQUIT.  Performing fast shutdown.
6/26 17:21:49 ShutdownFast all jobs.
6/26 17:21:49 **** condor_starter (condor_STARTER) EXITING WITH STATUS 0

Start log

6/26 17:19:48 ******************************************************
6/26 17:19:48 ** condor_startd (CONDOR_STARTD) STARTING UP
6/26 17:19:48 ** /opt/condor-6.6.10/sbin/condor_startd
6/26 17:19:48 ** $CondorVersion: 6.6.10 Jun 13 2005 $
6/26 17:19:48 ** $CondorPlatform: I386-LINUX_RH9 $
6/26 17:19:48 ** PID = 7893
6/26 17:19:48 ******************************************************
6/26 17:19:48 Using config file: /condor/condor_config
6/26 17:19:48 Using local config files: /condor/condor_config.local
6/26 17:19:48 DaemonCore: Command Socket at <172.16.50.10:9801>
6/26 17:19:49 New machine resource allocated
6/26 17:19:49 About to run initial benchmarks.
6/26 17:19:54 Completed initial benchmarks.
6/26 17:19:54 State change: IS_OWNER is false
6/26 17:19:54 Changing state: Owner -> Unclaimed
6/26 17:20:36 DaemonCore: Command received via UDP from host <172.16.50.11:9858> 6/26 17:20:36 DaemonCore: received command 440 (MATCH_INFO), calling handler (command_match_info)
6/26 17:20:36 match_info called
6/26 17:20:36 Received match <172.16.50.10:9801>#1994453280
6/26 17:20:36 State change: match notification protocol successful
6/26 17:20:36 Changing state: Unclaimed -> Matched
6/26 17:20:36 DaemonCore: Command received via TCP from host <172.16.50.10:9757> 6/26 17:20:36 DaemonCore: received command 442 (REQUEST_CLAIM), calling handler (command_request_claim)
6/26 17:20:36 Request accepted.
6/26 17:20:36 Remote owner is jr0407@xxxxxxxxxxxxxx
6/26 17:20:36 State change: claiming protocol successful
6/26 17:20:36 Changing state: Matched -> Claimed
6/26 17:20:39 DaemonCore: Command received via TCP from host <172.16.50.10:9793> 6/26 17:20:39 DaemonCore: received command 444 (ACTIVATE_CLAIM), calling handler (command_activate_claim)
6/26 17:20:39 Got activate_claim request from shadow (<172.16.50.10:9793>)
6/26 17:20:39 Remote job ID is 32.0
6/26 17:20:39 Got universe "VANILLA" (5) from request classad
6/26 17:20:39 State change: claim-activation protocol successful
6/26 17:20:39 Changing activity: Idle -> Busy
6/26 17:21:49 DaemonCore: Command received via TCP from host <172.16.50.10:9945> 6/26 17:21:49 DaemonCore: received command 404 (DEACTIVATE_CLAIM_FORCIBLY), calling handler (command_handler)
6/26 17:21:49 Called deactivate_claim_forcibly()
6/26 17:21:49 Starter pid 7905 exited with status 0
6/26 17:21:49 State change: starter exited
6/26 17:21:49 Changing activity: Busy -> Idle
6/26 17:21:49 DaemonCore: Command received via UDP from host <172.16.50.10:9850> 6/26 17:21:49 DaemonCore: received command 443 (RELEASE_CLAIM), calling handler (command_handler)
6/26 17:21:49 State change: received RELEASE_CLAIM command
6/26 17:21:49 Changing state and activity: Claimed/Idle -> Preempting/Vacating
6/26 17:21:49 State change: No preempting claim, returning to owner
6/26 17:21:49 Changing state and activity: Preempting/Vacating -> Owner/Idle
6/26 17:21:49 State change: IS_OWNER is false
6/26 17:21:49 Changing state: Owner -> Unclaimed
6/26 17:21:49 DaemonCore: Command received via UDP from host <172.16.50.10:9999> 6/26 17:21:49 DaemonCore: received command 443 (RELEASE_CLAIM), calling handler (command_handler) 6/26 17:21:49 Error: can't find resource with capability (<172.16.50.10:9801>#1994453280)