[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] SchedLog: job submission timed out....port problem?



Hi Rob:

Does Condor fail in the same way if the firewall software on the Windows machine is disabled?

-B

On 2010-09-10, at 7:32 AM, Rob wrote:

> 
> Hi,
> 
> I'm baffled!
> 
> A job is not running for days, although the negotiator matches
> the job to the specified machine (the machine is in Unclaimed
> state); apparent reason: a broken communication
> (and I suspected a firewall problem (see my earlier msg below)).
> 
> Then suddenly days later the job does start running. SchedLog:
> 
> 09/10 10:31:15 (pid:2109) attempt to connect to <115.145.228.20:1048> failed: 
> Connection timed out (connect errno = 110).
> 09/10 10:31:15 (pid:2109) Failed to send REQUEST_CLAIM to startd slot1@2-4-1 
> <115.145.228.20:1048> for user@xxxxxxxxxxxxxx: SECMAN:2003:TCP connection to 
> startd slot1@2-4-1 <115.145.228.20:1048> for user@xxxxxxxxxxxxxx failed.
> 09/10 10:31:15 (pid:2109) Match record (slot1@2-4-1 <115.145.228.20:1048> for 
> user@xxxxxxxxxxxxxx, 250.0) deleted
> 09/10 10:31:40 (pid:2109) Completed REQUEST_CLAIM to startd slot1@2-4-1 
> <115.145.228.20:4961> for user@xxxxxxxxxxxxxx
> 09/10 10:31:40 (pid:2109) Started shadow for job 250.0 on slot1@2-4-1 
> <115.145.228.20:4961> for user@xxxxxxxxxxxxxx, (shadow pid = 3739)
> 09/10 15:19:01 (pid:2109) match (slot1@2-4-1 <115.145.228.20:4961> for 
> user@xxxxxxxxxxxxxx) out of jobs; relinquishing
> 09/10 15:19:01 (pid:2109) Completed RELEASE_CLAIM to startd at 
> <115.145.228.20:4961>
> 09/10 15:19:01 (pid:2109) Match record (slot1@2-4-1 <115.145.228.20:4961> for 
> user@xxxxxxxxxxxxxx, 250.-1) deleted
> 
> 
> Why is the communication to this Unclaimed machine blocked for days and
> then suddenly the job submission works.....???
> 
> The "Failed to send REQUEST_CLAIM" happened with ports 1053 and 1048:
> 
> Failed to send REQUEST_CLAIM to startd slot1@2-4-1 <115.145.228.20:1053> for
>   user@xxxxxxxxxxxxxx: SECMAN:2003:TCP connection to startd slot1@2-4-1
>   <115.145.228.20:1053> for user@xxxxxxxxxxxxxx failed.
> Failed to send REQUEST_CLAIM to startd slot1@2-4-1 <115.145.228.20:1048> for
>   user@xxxxxxxxxxxxxx: SECMAN:2003:TCP connection to startd slot1@2-4-1
>   <115.145.228.20:1048> for user@xxxxxxxxxxxxxx failed.
> 
> 
> The "Completed REQUEST_CLAIM" happened with port 4961:
> 
> Completed REQUEST_CLAIM to startd slot1@2-4-1 <115.145.228.20:4961> for 
> user@xxxxxxxxxxxxxx
> 
> 
> What conclusion should a draw from this?
> Any suggestions?
> 
> 
> Thanks,
> Rob.
> 
> 
> ----------------------------------------------------------------
> On Wed, 8 Sep 2010 Rob wrote:
> 
> Hi,
> 
> I use a Linux master PC.
> I have a Windows pool PC (ip = 115.145.228.26 or name = "3-4")
> which is in the Unclaimed state.
> All are running Condor 7.4.3.
> 
> When I submit a Vanilla job, then NegotiatorLog tells me that the match is OK.
> 
> The SchedLog has then the following entries:
> 
> 09/09 12:54:25 (pid:2109) attempt to connect to <115.145.228.26:1042> failed: 
> Connection timed out (connect errno = 110).  Will keep trying for 45 total 
> seconds (24 to go).
> 09/09 12:54:50 (pid:2109) attempt to connect to <115.145.228.26:1042> failed: 
> Connection timed out (connect errno = 110).
> 09/09 12:54:50 (pid:2109) Failed to send REQUEST_CLAIM to startd slot1@3-4 
> <115.145.228.26:1042> for user@xxxxxxxxxxxxxxx: SECMAN:2003:TCP connection to 
> startd slot1@3-4 <115.145.228.26:1042> for user@xxxxxxxxxxxxxxx failed.
> 09/09 12:54:50 (pid:2109) Match record (slot1@3-4 <115.145.228.26:1042> for 
> user@xxxxxxxxxxxxxxx, 247.0) deleted
> 
> Apparently the network communication is not working.
> Can somebody tell me what communication or firewall rule
> is actually missing from these messages in SchedLog?
> 
> 
> The (linux) master does get the status info and it can
> get the Windows log files with condor_fetchlog.
> 
> The firewall on the Windows PC is a commercial Korean product
> (V3 from Ahnlab). I have allowed as firewall exceptions:
>  condor_dagman.exe
>  condor_kbdd.exe
>  condor_master.exe
>  condor_startd.exe
>  condor_starter.exe
>  condor_vm-gahp.exe
>  condor_preen.exe
> 
> It seems that this is not enough to allow full condor communication.....
> 
> Thanks.
> Rob.
> 
> 
> 
> 
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/

--
Ben Burnett
Department of Math & Computer Science
Optimization Research Group
University of Lethbridge
http://optimization.cs.uleth.ca

"I am against religion because it teaches us to be satisfied with not understanding the world."
- Richard Dawkins