Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Strange Condor Behavior - Possible Bug

Date: Thu, 01 Oct 2015 12:16:52 -0500
From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Strange Condor Behavior - Possible Bug

I have a guess about what is happening that explains all the pieces youhave observed so far... see below....


On 9/30/2015 4:03 PM, Deck, William wrote:

Some follow-up information for this issue.

Looking at the Starterlog for one of the problem jobs I see the following:

[snip]


09/30/15 08:34:53 Create_Process succeeded, pid=5601

09/30/15 09:37:17 Process exited, pid=5601, status=0

Anything in the Starter log after the job exited at 9:37 and the startdkilled it at 10:33 ?


It appears to me that the job is hung on transferring output for an hour
after running the job to completion.  Then after an hour the condor
daemon copying the data is determined to be hung and is killed.  However
we see the output file transferred to the schedd.  Similar behavior is
observed on all the jobs that don’t “finish”.   The behavior only seems
to appear in longer running jobs as all of the jobs are setup in the
same way.


Here is my theory, along with some background.

First the background - When the starter is born, it has a TCP socketgoing back to the shadow process on the submit host. This TCP socket iskept open for the entire lifetime of the job, and if the condor_shadowprocess on the submit side sees it get closed it thinks the executemachine disappeared which is why how you get the "Job disconnected,attempting to reconnect" messages in the job event log. Also this socketis used by the starter to send back the exit status when the jobfinishes, but it is not used to transfer back files as a new filetransfer TCP socket is created for that.

So my theory - Between your submit host and your execute host there is apiece of network equipment (likely a VPN, firewall, NAT box...) that isclosing this TCP socket between the starter and shadow because it hasnot seen any traffic on it in over an hour. That would explain why thefile transfer works but then sending the exit status does not, which iswhy the job is stuck in "Running" state (waiting to get the exit status).

In HTCondor v8.1.4 and above, a TCP socket KEEP_ALIVE option is used totry and keep firewalls etc from thinking the quiet socket is dead (seehttps://goo.gl/R8EA1G ). But cheap or misconfigured firewalls/NATs/VPNsetc may ignore this....


Todd

Thanks.

--

Will Deck


------------------------------------------------------------------------

IMPORTANT: The information contained in this email and/or its
attachments is confidential. If you are not the intended recipient,
please notify the sender immediately by reply and immediately delete
this message and all its attachments. Any review, use, reproduction,
disclosure or dissemination of this message or any attachment by an
unintended recipient is strictly prohibited. Neither this message nor
any attachment is intended as or should be construed as an offer,
solicitation or recommendation to buy or sell any security or other
financial instrument. Neither the sender, his or her employer nor any of
their respective affiliates makes any warranties as to the completeness
or accuracy of any of the information contained herein or that this
message or any of its attachments is free of viruses.


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/



--
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing   Department of Computer Sciences
HTCondor Technical Lead                1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132                  Madison, WI 53706-1685

Follow-Ups:
- Re: [HTCondor-users] Strange Condor Behavior - Possible Bug
  - From: Todd Tannenbaum

Prev by Date: Re: [HTCondor-users] CREAM error: Failed to start gahp
Next by Date: Re: [HTCondor-users] CREAM error: Failed to start gahp
Previous by thread: Re: [HTCondor-users] Strange Condor Behavior - Possible Bug
Next by thread: Re: [HTCondor-users] Strange Condor Behavior - Possible Bug
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [HTCondor-users] Strange Condor Behavior - Possible Bug