[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Flocked standard universe jobs failing



Hi,

We're seeing a strange problem of certain standard universe jobs failing between flocked pools, and we'd welcome any help. Consider two pools, call them A and B. Machines in A can successfully send standard universe jobs to other machines A, and similarly ones in B can send them successfully to others in B. When the two pools are flocked, then machines in pool A can successfully send such jobs to ones in B, but not vice versa. We've checked that the FLOCK_TO and FLOCK_FROM fields are set correctly and that, to our understanding, no intervening firewalls are causing the problem. The logs show the following behavior. First the ShadowLog of the failing submit host:

07/20/12 21:11:07 (?.?) (25282):******* Standard Shadow starting up *******
07/20/12 21:11:07 (?.?) (25282):** $CondorVersion: 7.6.6 Jan 17 2012 BuildID: 401976 $
07/20/12 21:11:07 (?.?) (25282):** $CondorPlatform: x86_64_rhap_6.1-updated $
07/20/12 21:11:07 (?.?) (25282):*******************************************
07/20/12 21:11:07 (?.?) (25282):uid=0, euid=206, gid=0, egid=206
07/20/12 21:11:07 (?.?) (25282):Hostname = "<172.24.89.129:9821>", Job = 55.0
07/20/12 21:11:07 (55.0) (25282):Requesting Primary Starter
07/20/12 21:11:07 (55.0) (25282):Shadow: Request to run a job was ACCEPTED
07/20/12 21:11:07 (55.0) (25282):Shadow: RSC_SOCK connected, fd = 17
07/20/12 21:11:07 (55.0) (25282):Shadow: CLIENT_LOG connected, fd = 18
07/20/12 21:11:07 (55.0) (25282):My_Filesystem_Domain = "the.canonical.hostname"
07/20/12 21:11:07 (55.0) (25282):My_UID_Domain = "the.canonical.hostname"
07/20/12 21:11:07 (55.0) (25282):       Entering pseudo_get_file_stream
07/20/12 21:11:07 (55.0) (25282):       file = "/var/lib/condor/spool/55/cluster55.ickpt.subproc0"
07/20/12 21:11:28 (55.0) (25282):       Entering pseudo_get_file_stream
07/20/12 21:11:28 (55.0) (25282):       file = "/var/lib/condor/spool/55/cluster55.ickpt.subproc0"
07/20/12 21:11:49 (55.0) (25282):       Entering pseudo_get_file_stream
07/20/12 21:11:49 (55.0) (25282):       file = "/var/lib/condor/spool/55/cluster55.ickpt.subproc0"
07/20/12 21:12:10 (55.0) (25282):Shadow: Job 55.0 exited, termsig = 9, coredump = 0, retcode = 0
07/20/12 21:12:10 (55.0) (25282):Shadow: Job was kicked off without a checkpoint
07/20/12 21:12:10 (55.0) (25282):Shadow: DoCleanup: unlinking TmpCkpt '/var/lib/condor/spool/55/0/cluster55.proc0.subproc0.tmp'
07/20/12 21:12:10 (55.0) (25282):Trying to unlink /var/lib/condor/spool/55/0/cluster55.proc0.subproc0.tmp
07/20/12 21:12:10 (55.0) (25282):user_time = 0 ticks
07/20/12 21:12:10 (55.0) (25282):sys_time = 0 ticks
07/20/12 21:12:10 (55.0) (25282):********** Shadow Exiting(107) **********
07/20/12 21:16:07 (55.0) (25283):connect_file_stream: accept failed (-2), errno = 98
07/20/12 21:16:28 (55.0) (25306):connect_file_stream: accept failed (-2), errno = 98

And the corresponding StarterLog fragment from the would be execute host:

07/20/12 21:11:07   env = _condor_LOWPORT=9600 _CONDOR_SLOT=slot1 CONDOR_SCRATCH_DIR=/home/condor/execute/dir_2690 _condor_BIND_ALL_INTERFACES=FALSE _condor_HIGHPORT=9900
07/20/12 21:11:07   local_dir = dir_2690
07/20/12 21:11:07   cur_ckpt = dir_2690/condor_exec.55.0
07/20/12 21:11:07   core_name = (either 'core' or 'core.<pid>')
07/20/12 21:11:07   uid = 3385, gid = 3385
07/20/12 21:11:07   v_pid = -1
07/20/12 21:11:07   pid = (NOT CURRENTLY EXECUTING)
07/20/12 21:11:07   exit_status_valid = FALSE07/20/12 21:11:07   exit_status = (NEVER BEEN EXECUTED)
07/20/12 21:11:07   ckpt_wanted = TRUE
07/20/12 21:11:07   coredump_limit_exists = TRUE
07/20/12 21:11:07   coredump_limit = 0
07/20/12 21:11:07   soft_kill_sig = 20
07/20/12 21:11:07   job_class = STANDARD
07/20/12 21:11:07   state = NEW
07/20/12 21:11:07   new_ckpt_created = FALSE
07/20/12 21:11:07   ckpt_transferred = FALSE
07/20/12 21:11:07   core_created = FALSE
07/20/12 21:11:07   core_transferred = FALSE
07/20/12 21:11:07   exit_requested = FALSE
07/20/12 21:11:07   image_size = -1 blocks
07/20/12 21:11:07   user_time = 0
07/20/12 21:11:07   sys_time = 0
07/20/12 21:11:07   guaranteed_user_time = 0
07/20/12 21:11:07   guaranteed_sys_time = 0
07/20/12 21:11:07 }
07/20/12 21:11:07       *FSM* Transitioning to state "GET_EXEC"
07/20/12 21:11:07       *FSM* Executing state func "get_exec()" [ SUSPEND VACATE DIE  ]
07/20/12 21:11:07 Entering get_exec()
07/20/12 21:11:07 Executable is located on submitting host
07/20/12 21:11:07 Expanded executable name is "/var/lib/condor/spool/55/cluster55.ickpt.subproc0"
07/20/12 21:11:07 Going to try 3 attempts at getting the initial executable
07/20/12 21:11:07 Entering get_file( /var/lib/condor/spool/55/cluster55.ickpt.subproc0, dir_2690/condor_exec.55.0, 0755 )
07/20/12 21:11:28 connect() failed - errno = 110
07/20/12 21:11:28 open_tcp_stream() failed
07/20/12 21:11:28 Failed to open "/var/lib/condor/spool/55/cluster55.ickpt.subproc0" remotely, errno = 110
07/20/12 21:11:28 Failed to fetch orig ckpt file "/var/lib/condor/spool/55/cluster55.ickpt.subproc0" into "dir_2690/condor_exec.55.0", attempt = 1, errno = 110
07/20/12 21:11:28 Entering get_file( /var/lib/condor/spool/55/cluster55.ickpt.subproc0, dir_2690/condor_exec.55.0, 0755 )
07/20/12 21:11:49 connect() failed - errno = 110
07/20/12 21:11:49 open_tcp_stream() failed
07/20/12 21:11:49 Failed to open "/var/lib/condor/spool/55/cluster55.ickpt.subproc0" remotely, errno = 110
07/20/12 21:11:49 Failed to fetch orig ckpt file "/var/lib/condor/spool/55/cluster55.ickpt.subproc0" into "dir_2690/condor_exec.55.0", attempt = 2, errno = 110
07/20/12 21:11:49 Entering get_file( /var/lib/condor/spool/55/cluster55.ickpt.subproc0, dir_2690/condor_exec.55.0, 0755 )
07/20/12 21:12:10 connect() failed - errno = 110
07/20/12 21:12:10 open_tcp_stream() failed
07/20/12 21:12:10 Failed to open "/var/lib/condor/spool/55/cluster55.ickpt.subproc0" remotely, errno = 110
07/20/12 21:12:10 Failed to fetch orig ckpt file "/var/lib/condor/spool/55/cluster55.ickpt.subproc0" into "dir_2690/condor_exec.55.0", attempt = 3, errno = 110
07/20/12 21:12:10       *FSM* Executing transition function "dispose_one"
07/20/12 21:12:10 Sending final status for process 55.007/20/12 21:12:10 STATUS encoded as CKPT, *NOT* TRANSFERRED
07/20/12 21:12:10 User time = 0.000000 seconds
07/20/12 21:12:10 System time = 0.000000 seconds
07/20/12 21:12:10 Can't unlink "dir_2690/condor_exec.55.0" - errno = 2
07/20/12 21:12:10 Removing directory "dir_2690"

That errno 110 in the StarterLog appears to suggest that a connection attempt to the submit host is failing, but using tcptraceroute we can confirm an open path from the execute host to ports on the submit host. Is there something else we should be checking? For what it's worth, the submit host has two NICs, and Condor operates on the one that does not correspond to its canonical hostname. We set NETWORK_INTERFACE to pick the appropriate interface and set BIND_ALL_INTERFACES to False.

TIA,
Bob