[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Condor-users] HIGHPORT and LOWPORT



To follow up this issue -

There is something else, which could be limiting the number of running jobs.
I have stumbled on this as well.
For example: 
I have HIGHPORT = 29000 and LOWPORT = 9600 on the Condor master server
(RH9, 6.7.2) and on submit hosts (Windows XP SP1, SP2, RH9) and about 
100 workstations online at the moment (Windows XP SP1, Condor 6.7.2) out of
which about 60 are idle and unclaimed. And yet, not more than around 20 jobs
can run on the whole pool at a time. If I or other user submit say 100 jobs,
the 
Condor would match quickly first 20 of them and then says (NegotiatorLog):
...
1/15 18:46:51       Matched 303.19 Kaliazia@xxxxxxxxxxx <134.151.145.3:9666>
preempting none <134.151.149.134:9609>
1/15 18:46:51       Successfully matched with cs-357pc04.aston.ac.uk
1/15 18:46:51     Got NO_MORE_JOBS;  done negotiating
1/15 18:46:51 ---------- Finished Negotiation Cycle ----------

 (While there are 80 more apparently)
After that negotiator would continue to match jobs only to replace those
which 
ended up successfully.

After some investigation I have found that there were several workstations
in the
pool which negotiator was trying to connect to persistently, but without
success. 
As soon as I rebooted those machines one by one, negotiator would happily go
ahead and stumble on the next one. After restarting all those rogue PCs,
negotiator
easily managed to match all available resources.

All those PCs are running Condor (for Windows) versions 6.7.2 and 6.7.3 and
have
identical config files. The only suspicion I have about what went wrong with
those
machines is that their ports were out of allowed range (which is 9600-9700.)
Here is an example (from NegotiatorLog) -

1/15 19:47:17     Request 00182.00010:
1/15 19:50:26 Can't connect to <134.151.149.224:9595>:0, errno = 110
1/15 19:50:26 Will keep trying for 10 seconds...
1/15 19:50:27 Connect failed for 10 seconds; returning FALSE
1/15 19:50:27 ERROR: SECMAN:2003:TCP connection to <134.151.149.224:9595>
failed

9595 port is clearly out of range, but I have no idea how and why it
happens.

Could Condor developers please shed some light?
Is it possible to force Condor daemons to use fixed ports?
And if this situation occurs again, how to make negotiator to skip the rogue
element
and try another node?

Thanks,

Andrey Kaliazin, Computer Officer
Computer Science, Aston University, Birmingham, UK


> -----Original Message-----
> From: condor-users-bounces@xxxxxxxxxxx 
> [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Se-Chang Son
> Sent: Friday, January 14, 2005 8:21 PM
> To: Condor-Users Mail List
> Cc: Karen Miller
> Subject: Re: [Condor-users] HIGHPORT and LOWPORT
> 
> Masao Fujinaga wrote:
> > Does setting the port range using
> > HIGHPORT = 9700
> > LOWPORT = 9600
> > limit the number of jobs that can be running? With the 
> above limits, I 
> > could only get about 40 jobs to run. With the limits 
> removed, I was able 
> > to run more (62, the number of machines that I had available).
> 
> Yes. Each job requires two addresses plus several fixed number of 
> addresses per submit machine. Therefore, you pretty much hit the wall 
> with about 40 jobs. Karen Miller is adding, in the manual, stuff that 
> explains how big the address range must be.
> 
> > 
> > Masao
> > 
> > -- 
> > Masao Fujinaga | Research Computing Support
> > fujinaga@xxxxxxxxxxx | Computing and Network Services
> > Tel.: (780) 492-2117 | University of Alberta
> > Fax.: (780) 492-1729 | Edmonton, Alberta, CANADA
> > 
> > 
> > 
> --------------------------------------------------------------
> ----------
> > 
> > _______________________________________________
> > Condor-users mailing list
> > Condor-users@xxxxxxxxxxx
> > http://lists.cs.wisc.edu/mailman/listinfo/condor-users
> _______________________________________________
> Condor-users mailing list
> Condor-users@xxxxxxxxxxx
> http://lists.cs.wisc.edu/mailman/listinfo/condor-users
>