[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Strange Condor Behavior - Possible Bug
- Date: Thu, 01 Oct 2015 12:16:52 -0500
- From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] Strange Condor Behavior - Possible Bug
I have a guess about what is happening that explains all the pieces you
have observed so far... see below....
On 9/30/2015 4:03 PM, Deck, William wrote:
Some follow-up information for this issue.
Looking at the Starterlog for one of the problem jobs I see the following:
09/30/15 08:34:53 Create_Process succeeded, pid=5601
09/30/15 09:37:17 Process exited, pid=5601, status=0
Anything in the Starter log after the job exited at 9:37 and the startd
killed it at 10:33 ?
It appears to me that the job is hung on transferring output for an hour
after running the job to completion. Then after an hour the condor
daemon copying the data is determined to be hung and is killed. However
we see the output file transferred to the schedd. Similar behavior is
observed on all the jobs that don’t “finish”. The behavior only seems
to appear in longer running jobs as all of the jobs are setup in the
Here is my theory, along with some background.
First the background - When the starter is born, it has a TCP socket
going back to the shadow process on the submit host. This TCP socket is
kept open for the entire lifetime of the job, and if the condor_shadow
process on the submit side sees it get closed it thinks the execute
machine disappeared which is why how you get the "Job disconnected,
attempting to reconnect" messages in the job event log. Also this socket
is used by the starter to send back the exit status when the job
finishes, but it is not used to transfer back files as a new file
transfer TCP socket is created for that.
So my theory - Between your submit host and your execute host there is a
piece of network equipment (likely a VPN, firewall, NAT box...) that is
closing this TCP socket between the starter and shadow because it has
not seen any traffic on it in over an hour. That would explain why the
file transfer works but then sending the exit status does not, which is
why the job is stuck in "Running" state (waiting to get the exit status).
In HTCondor v8.1.4 and above, a TCP socket KEEP_ALIVE option is used to
try and keep firewalls etc from thinking the quiet socket is dead (see
https://goo.gl/R8EA1G ). But cheap or misconfigured firewalls/NATs/VPNs
etc may ignore this....
IMPORTANT: The information contained in this email and/or its
attachments is confidential. If you are not the intended recipient,
please notify the sender immediately by reply and immediately delete
this message and all its attachments. Any review, use, reproduction,
disclosure or dissemination of this message or any attachment by an
unintended recipient is strictly prohibited. Neither this message nor
any attachment is intended as or should be construed as an offer,
solicitation or recommendation to buy or sell any security or other
financial instrument. Neither the sender, his or her employer nor any of
their respective affiliates makes any warranties as to the completeness
or accuracy of any of the information contained herein or that this
message or any of its attachments is free of viruses.
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
You can also unsubscribe by visiting
The archives can be found at:
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing Department of Computer Sciences
HTCondor Technical Lead 1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132 Madison, WI 53706-1685