Mailing List Archives
Public Access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] SchedLog: job submission timed out....port problem?
- Date: Fri, 10 Sep 2010 13:52:42 -0600
- From: "Burnett, Ben" <ben.burnett@xxxxxxxx>
- Subject: Re: [Condor-users] SchedLog: job submission timed out....port problem?
Hi Rob:
Does Condor fail in the same way if the firewall software on the Windows machine is disabled?
-B
On 2010-09-10, at 7:32 AM, Rob wrote:
>
> Hi,
>
> I'm baffled!
>
> A job is not running for days, although the negotiator matches
> the job to the specified machine (the machine is in Unclaimed
> state); apparent reason: a broken communication
> (and I suspected a firewall problem (see my earlier msg below)).
>
> Then suddenly days later the job does start running. SchedLog:
>
> 09/10 10:31:15 (pid:2109) attempt to connect to <115.145.228.20:1048> failed:
> Connection timed out (connect errno = 110).
> 09/10 10:31:15 (pid:2109) Failed to send REQUEST_CLAIM to startd slot1@2-4-1
> <115.145.228.20:1048> for user@xxxxxxxxxxxxxx: SECMAN:2003:TCP connection to
> startd slot1@2-4-1 <115.145.228.20:1048> for user@xxxxxxxxxxxxxx failed.
> 09/10 10:31:15 (pid:2109) Match record (slot1@2-4-1 <115.145.228.20:1048> for
> user@xxxxxxxxxxxxxx, 250.0) deleted
> 09/10 10:31:40 (pid:2109) Completed REQUEST_CLAIM to startd slot1@2-4-1
> <115.145.228.20:4961> for user@xxxxxxxxxxxxxx
> 09/10 10:31:40 (pid:2109) Started shadow for job 250.0 on slot1@2-4-1
> <115.145.228.20:4961> for user@xxxxxxxxxxxxxx, (shadow pid = 3739)
> 09/10 15:19:01 (pid:2109) match (slot1@2-4-1 <115.145.228.20:4961> for
> user@xxxxxxxxxxxxxx) out of jobs; relinquishing
> 09/10 15:19:01 (pid:2109) Completed RELEASE_CLAIM to startd at
> <115.145.228.20:4961>
> 09/10 15:19:01 (pid:2109) Match record (slot1@2-4-1 <115.145.228.20:4961> for
> user@xxxxxxxxxxxxxx, 250.-1) deleted
>
>
> Why is the communication to this Unclaimed machine blocked for days and
> then suddenly the job submission works.....???
>
> The "Failed to send REQUEST_CLAIM" happened with ports 1053 and 1048:
>
> Failed to send REQUEST_CLAIM to startd slot1@2-4-1 <115.145.228.20:1053> for
> user@xxxxxxxxxxxxxx: SECMAN:2003:TCP connection to startd slot1@2-4-1
> <115.145.228.20:1053> for user@xxxxxxxxxxxxxx failed.
> Failed to send REQUEST_CLAIM to startd slot1@2-4-1 <115.145.228.20:1048> for
> user@xxxxxxxxxxxxxx: SECMAN:2003:TCP connection to startd slot1@2-4-1
> <115.145.228.20:1048> for user@xxxxxxxxxxxxxx failed.
>
>
> The "Completed REQUEST_CLAIM" happened with port 4961:
>
> Completed REQUEST_CLAIM to startd slot1@2-4-1 <115.145.228.20:4961> for
> user@xxxxxxxxxxxxxx
>
>
> What conclusion should a draw from this?
> Any suggestions?
>
>
> Thanks,
> Rob.
>
>
> ----------------------------------------------------------------
> On Wed, 8 Sep 2010 Rob wrote:
>
> Hi,
>
> I use a Linux master PC.
> I have a Windows pool PC (ip = 115.145.228.26 or name = "3-4")
> which is in the Unclaimed state.
> All are running Condor 7.4.3.
>
> When I submit a Vanilla job, then NegotiatorLog tells me that the match is OK.
>
> The SchedLog has then the following entries:
>
> 09/09 12:54:25 (pid:2109) attempt to connect to <115.145.228.26:1042> failed:
> Connection timed out (connect errno = 110). Will keep trying for 45 total
> seconds (24 to go).
> 09/09 12:54:50 (pid:2109) attempt to connect to <115.145.228.26:1042> failed:
> Connection timed out (connect errno = 110).
> 09/09 12:54:50 (pid:2109) Failed to send REQUEST_CLAIM to startd slot1@3-4
> <115.145.228.26:1042> for user@xxxxxxxxxxxxxxx: SECMAN:2003:TCP connection to
> startd slot1@3-4 <115.145.228.26:1042> for user@xxxxxxxxxxxxxxx failed.
> 09/09 12:54:50 (pid:2109) Match record (slot1@3-4 <115.145.228.26:1042> for
> user@xxxxxxxxxxxxxxx, 247.0) deleted
>
> Apparently the network communication is not working.
> Can somebody tell me what communication or firewall rule
> is actually missing from these messages in SchedLog?
>
>
> The (linux) master does get the status info and it can
> get the Windows log files with condor_fetchlog.
>
> The firewall on the Windows PC is a commercial Korean product
> (V3 from Ahnlab). I have allowed as firewall exceptions:
> condor_dagman.exe
> condor_kbdd.exe
> condor_master.exe
> condor_startd.exe
> condor_starter.exe
> condor_vm-gahp.exe
> condor_preen.exe
>
> It seems that this is not enough to allow full condor communication.....
>
> Thanks.
> Rob.
>
>
>
>
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/
--
Ben Burnett
Department of Math & Computer Science
Optimization Research Group
University of Lethbridge
http://optimization.cs.uleth.ca
"I am against religion because it teaches us to be satisfied with not understanding the world."
- Richard Dawkins