[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Strange Condor Behavior - Possible Bug



I have a guess about what is happening that explains all the pieces you have observed so far... see below....

On 9/30/2015 4:03 PM, Deck, William wrote:
Some follow-up information for this issue.

Looking at the Starterlog for one of the problem jobs I see the following:

[snip]

09/30/15 08:34:53 Create_Process succeeded, pid=5601

09/30/15 09:37:17 Process exited, pid=5601, status=0


Anything in the Starter log after the job exited at 9:37 and the startd killed it at 10:33 ?



It appears to me that the job is hung on transferring output for an hour
after running the job to completion.  Then after an hour the condor
daemon copying the data is determined to be hung and is killed.  However
we see the output file transferred to the schedd.  Similar behavior is
observed on all the jobs that don’t “finish”.   The behavior only seems
to appear in longer running jobs as all of the jobs are setup in the
same way.


Here is my theory, along with some background.

First the background - When the starter is born, it has a TCP socket going back to the shadow process on the submit host. This TCP socket is kept open for the entire lifetime of the job, and if the condor_shadow process on the submit side sees it get closed it thinks the execute machine disappeared which is why how you get the "Job disconnected, attempting to reconnect" messages in the job event log. Also this socket is used by the starter to send back the exit status when the job finishes, but it is not used to transfer back files as a new file transfer TCP socket is created for that.

So my theory - Between your submit host and your execute host there is a piece of network equipment (likely a VPN, firewall, NAT box...) that is closing this TCP socket between the starter and shadow because it has not seen any traffic on it in over an hour. That would explain why the file transfer works but then sending the exit status does not, which is why the job is stuck in "Running" state (waiting to get the exit status).

In HTCondor v8.1.4 and above, a TCP socket KEEP_ALIVE option is used to try and keep firewalls etc from thinking the quiet socket is dead (see https://goo.gl/R8EA1G ). But cheap or misconfigured firewalls/NATs/VPNs etc may ignore this....

Todd



Thanks.

--

Will Deck


------------------------------------------------------------------------

IMPORTANT: The information contained in this email and/or its
attachments is confidential. If you are not the intended recipient,
please notify the sender immediately by reply and immediately delete
this message and all its attachments. Any review, use, reproduction,
disclosure or dissemination of this message or any attachment by an
unintended recipient is strictly prohibited. Neither this message nor
any attachment is intended as or should be construed as an offer,
solicitation or recommendation to buy or sell any security or other
financial instrument. Neither the sender, his or her employer nor any of
their respective affiliates makes any warranties as to the completeness
or accuracy of any of the information contained herein or that this
message or any of its attachments is free of viruses.


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/



--
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing   Department of Computer Sciences
HTCondor Technical Lead                1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132                  Madison, WI 53706-1685