[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] shadow exception of few dagman node jobs



Hello,

we've found few failing dagman node jobs with the message "Shadow execption!".


This is what we found in the users condor logs:

:

001 (8766221.000.000) 07/21 14:21:40 Job executing on host: <10.10.0.62:50850>

:

007 (8766221.000.000) 07/21 15:19:38 Shadow exception!
        Unable to talk to job: Connection timed out

        269  -  Run Bytes Sent By Job
        11200862  -  Run Bytes Received By Job



Then I was looking onto the execution host. It seems that parts or all of the
STDERR output moved into the system logs of the execution host. I can't
interprete the end of the log block but it might point to a connectivity
problem. Do you see something else?
Do you need additional log or configuration information?


Thank you and cheers,
Henning

<snip>
:
07/21/13 14:21:39 ********** STARTER starting up ***********
07/21/13 14:21:39 ** $CondorVersion: 7.8.6 Oct 24 2012 BuildID: 73238 $
07/21/13 14:21:39 ** $CondorPlatform: x86_64_deb_6.0 $
07/21/13 14:21:39 ******************************************
07/21/13 14:21:39 Submitting machine is "atlas3.atlas.local"
07/21/13 14:21:39 EventHandler {
07/21/13 14:21:39 	func = 0x413ff0
07/21/13 14:21:39 	mask = SIGALRM SIGHUP SIGINT SIGUSR1 SIGUSR2 SIGCHLD SIGTSTP 
07/21/13 14:21:39 }
07/21/13 14:21:39 my_popenv failed
07/21/13 14:21:39 Done setting resource limits
07/21/13 14:21:39 	*FSM* Transitioning to state "GET_PROC"
07/21/13 14:21:39 	*FSM* Executing state func "get_proc()" [  ]
07/21/13 14:21:39 Entering get_proc()
07/21/13 14:21:39 Entering get_job_info()
07/21/13 14:21:39 Startup Info:
07/21/13 14:21:39 	Version Number: 1
07/21/13 14:21:39 	Id: 8766221.0
07/21/13 14:21:39 	JobClass: STANDARD
07/21/13 14:21:39 	Uid: 5201
07/21/13 14:21:39 	Gid: 5000
07/21/13 14:21:39 	VirtPid: -1
07/21/13 14:21:39 	SoftKillSignal: 20
07/21/13 14:21:39 	Cmd: "/home/egoetz/opt/lscsoft/bin/lalapps_TwoSpect"
07/21/13 14:21:39 	Args: "--config=/home/egoetz/TwoSpect/S6/394-395Hz_H1/20130708/in/5087"
07/21/13 14:21:39 	Env: ""
07/21/13 14:21:39 	Iwd: "/home/egoetz/TwoSpect/S6/394-395Hz_H1/20130708"
07/21/13 14:21:39 	Ckpt Wanted: TRUE
07/21/13 14:21:39 	Is Restart: FALSE
07/21/13 14:21:39 	Core Limit Valid: TRUE
07/21/13 14:21:39 	Coredump Limit 0
07/21/13 14:21:39 User uid set to 5201
07/21/13 14:21:39 User uid set to 5000
07/21/13 14:21:39 User Process 8766221.0 {
07/21/13 14:21:39   cmd = /home/egoetz/opt/lscsoft/bin/lalapps_TwoSpect
07/21/13 14:21:39   args = --config=/home/egoetz/TwoSpect/S6/394-395Hz_H1/20130708/in/5087
07/21/13 14:21:39   env = _CONDOR_SLOT=slot1_1 CONDOR_SCRATCH_DIR=/local/condor/execute/dir_14199 _condor_BIND_ALL_INTERFACES=TRUE
07/21/13 14:21:39   local_dir = dir_14199
07/21/13 14:21:39   cur_ckpt = dir_14199/condor_exec.8766221.0
07/21/13 14:21:39   core_name = (either 'core' or 'core.<pid>')
07/21/13 14:21:39   uid = 5201, gid = 5000
07/21/13 14:21:39   v_pid = -1
07/21/13 14:21:39   pid = (NOT CURRENTLY EXECUTING)
07/21/13 14:21:39   exit_status_valid = FALSE
07/21/13 14:21:39   exit_status = (NEVER BEEN EXECUTED)
07/21/13 14:21:39   ckpt_wanted = TRUE
07/21/13 14:21:39   coredump_limit_exists = TRUE
07/21/13 14:21:39   coredump_limit = 0
07/21/13 14:21:39   soft_kill_sig = 20
07/21/13 14:21:39   job_class = STANDARD
07/21/13 14:21:39   state = NEW
07/21/13 14:21:39   new_ckpt_created = FALSE
07/21/13 14:21:39   ckpt_transferred = FALSE
07/21/13 14:21:39   core_created = FALSE
07/21/13 14:21:39   core_transferred = FALSE
07/21/13 14:21:39   exit_requested = FALSE
07/21/13 14:21:39   image_size = -1 blocks
07/21/13 14:21:39   user_time = 0
07/21/13 14:21:39   sys_time = 0
07/21/13 14:21:39   guaranteed_user_time = 0
07/21/13 14:21:39   guaranteed_sys_time = 0
07/21/13 14:21:39 }
07/21/13 14:21:39 	*FSM* Transitioning to state "GET_EXEC"
07/21/13 14:21:39 	*FSM* Executing state func "get_exec()" [ SUSPEND VACATE DIE  ]
07/21/13 14:21:39 Entering get_exec()
07/21/13 14:21:39 Executable is located on submitting host
07/21/13 14:21:39 Expanded executable name is "/local/condor/spool/6221/cluster8766221.ickpt.subproc0"
07/21/13 14:21:39 Going to try 3 attempts at getting the initial executable
07/21/13 14:21:39 Entering get_file( /local/condor/spool/6221/cluster8766221.ickpt.subproc0, dir_14199/condor_exec.8766221.0, 0755 )
07/21/13 14:21:39 Opened "/local/condor/spool/6221/cluster8766221.ickpt.subproc0" via file stream
07/21/13 14:21:40 Get_file() transferred 11199984 bytes, 14915848 bytes/second
07/21/13 14:21:40 Fetched orig ckpt file "/local/condor/spool/6221/cluster8766221.ickpt.subproc0" into "dir_14199/condor_exec.8766221.0" with 1 attempt
07/21/13 14:21:40 Executable 'dir_14199/condor_exec.8766221.0' is linked with "$CondorVersion: 7.8.6 Oct 24 2012 BuildID: 73238 $" on a "$CondorPlatform: x86_64_deb_6.0 $"
07/21/13 14:21:40 	*FSM* Executing transition function "spawn_all"
07/21/13 14:21:40 Pipe built
07/21/13 14:21:40 New pipe_fds[14,1]
07/21/13 14:21:40 cmd_fd = 14
07/21/13 14:21:40 Calling execve( "/local/condor/execute/dir_14199/condor_exec.8766221.0", "condor_exec.8766221.0", "-_condor_cmd_fd", "14", "--config=/home/egoetz/TwoSpect/S6/394-395Hz_H1/20130708/in/5087", 0, "_CONDOR_SLOT=slot1_1", "CONDOR_SCRATCH_DIR=/local/condor/execute/dir_14199", "_condor_BIND_ALL_INTERFACES=TRUE", 0 )
07/21/13 14:21:40 Started user job - PID = 14259
07/21/13 14:21:40 cmd_fp = 0x1102dd0
07/21/13 14:21:40 end
07/21/13 14:21:40 	*FSM* Transitioning to state "SUPERVISE"
07/21/13 14:21:40 	*FSM* Got asynchronous event "CHILD_EXIT"
07/21/13 14:21:40 	*FSM* Executing transition function "reaper"
07/21/13 14:21:40 	*FSM* Aborting transition function "reaper"
07/21/13 14:21:40 	*FSM* Executing state func "supervise_all()" [ GET_NEW_PROC SUSPEND VACATE ALARM DIE CHILD_EXIT PERIODIC_CKPT  ]
07/21/13 15:19:41 	*FSM* Got asynchronous event "VACATE"
07/21/13 15:19:41 	*FSM* Executing transition function "req_vacate"
07/21/13 15:19:41 req_ckpt_exit_all: Proc -1 in state EXECUTING
07/21/13 15:19:41 Requesting Exit on proc #-1
07/21/13 15:19:41 UserProc::send_sig_no_privsep(): Sent signal SIGCONT to user job 14259
07/21/13 15:19:41 UserProc::send_sig(): Sent signal SIGTSTP to user job 14259
07/21/13 15:19:41 	*FSM* Transitioning to state "TERMINATE"
07/21/13 15:19:41 	*FSM* Executing state func "terminate_all()" [  ]
07/21/13 15:19:41 	*FSM* Transitioning to state "TERMINATE_WAIT"
07/21/13 15:19:41 	*FSM* Executing state func "asynch_wait()" [ SUSPEND ALARM DIE CHILD_EXIT  ]
Program lalapps_TwoSpect 1.1.27 executed on Sun Jul 21 12:21:40 2013
%% LAL: 6.10.0.1 (CLEAN 73a202fb18603ed626a567394b8370fb9f6c0c14)
%% LALFrame: 1.0.10.1 (CLEAN 73a202fb18603ed626a567394b8370fb9f6c0c14)
%% LALPulsar: 1.7.0.1 (CLEAN 73a202fb18603ed626a567394b8370fb9f6c0c14)
%% LALApps: 6.12.0.1 (CLEAN 73a202fb18603ed626a567394b8370fb9f6c0c14)

Input parameters file: /home/egoetz/TwoSpect/S6/394-395Hz_H1/20130708/in/5087
Input SFT file: /atlas/user/atlas3/egoetz/twospect/LHO/1800s_sfts/393-396Hz/*_v2.sft
Output directory: /home/egoetz/TwoSpect/S6/394-395Hz_H1/20130708/output/5087
NOTE: random seed value 715637 is being used
WARNING: Marking bad SFTs
WARNING! Adjusting input minimum modulation depth to 1/2 a frequency bin!
sftType = standard
IFO = H1
Sky file = /home/egoetz/TwoSpect/S6/394-395Hz_H1/20130708/in/skygrid.5087
FAR for templates = 1e-18
Loading in SFTs... Duty factor = 0.409158
TF before weighting, mean subtraction: mean = 0.0404386, std. dev. = 0.347621
done
Assessing background... done
WARNING: 1 line(s) found.
Maximum row width to be searched = 361
Starting TwoSpect analysis...
:
.......A lot of STDERR output....
:
Candidates found in IHS step = 0
Initial stage done with candidates = 0
Sky location: RA = 5.88546, DEC = 0.757225
07/21/13 15:19:42 	*FSM* Got asynchronous event "CHILD_EXIT"
07/21/13 15:19:42 	*FSM* Executing transition function "reaper"
07/21/13 15:19:42 Process 14259 killed by signal 11
07/21/13 15:19:42 Process exited abnormally
07/21/13 15:19:42 	*FSM* Transitioning to state "TERMINATE"
07/21/13 15:19:42 	*FSM* Executing state func "terminate_all()" [  ]
07/21/13 15:19:42 No core file to send - probably ran out of disk
07/21/13 15:19:42 	*FSM* Transitioning to state "SEND_STATUS_ALL"
07/21/13 15:19:42 	*FSM* Executing state func "dispose_all()" [  ]
07/21/13 15:19:42 Sending final status for process 8766221.0
07/21/13 15:19:42 STATUS encoded as ABNORMAL, NO CORE
07/21/13 15:19:42 User time = 3345.310000 seconds
07/21/13 15:19:42 System time = 12.680000 seconds
07/21/13 15:19:42 condor_write(): Socket closed when trying to write 165 bytes to , fd is 17
07/21/13 15:19:42 StdUnivBuf::write(): condor_write() failed
Stack dump for process 14199 at timestamp 1374412782 (13 frames)
/usr/lib/condor/libcondor_utils_7_8_6.so(dprintf_dump_stack+0x131)[0x7fec7c2dae61]
/usr/lib/condor/libcondor_utils_7_8_6.so(+0x1dc4f2)[0x7fec7c3194f2]
/lib/libpthread.so.0(+0xeff0)[0x7fec77e29ff0]
/lib/libc.so.6(gsignal+0x35)[0x7fec77aeb1b5]
/lib/libc.so.6(abort+0x180)[0x7fec77aedfc0]
/lib/libc.so.6(__assert_fail+0xf1)[0x7fec77ae4301]
condor_starter(REMOTE_CONDOR_reallyexit+0x1db)[0x4331ae]
condor_starter(_Z17send_final_statusP8UserProc+0x5f)[0x41263f]
condor_starter(_Z11dispose_allv+0x30)[0x413010]
condor_starter(_ZN12StateMachine7executeEv+0x210)[0x4142d0]
condor_starter(main+0x93)[0x412d43]
/lib/libc.so.6(__libc_start_main+0xfd)[0x7fec77ad7c8d]
condor_starter[0x40cc79]
:
</snip>