[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Condor 7.4.2 not working on Rocks 5.3



Your restricted port range will be primarily an issue on the machine running the Schedd&Shadows.

 http://www.cs.wisc.edu/condor/manual/v7.5/3_7Networking_includes.html#SECTION00471000000000000000

 https://condor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToManageLargeCondorPools
   Each running job requires two, occasionally three, network ports on the submit machine. In 2.6 Linux, the ephemeral port range is typically 32768 through 61000, so from a single submit machine this limits you to 14000 simultaneously running jobs (more realistically 12000 in my experience). In Linux, you can increase the ephemeral port range via /proc/sys/net/ipv4/ip_local_port_range.

All you likely need is to set SHADOW.LOWPORT/SHADOW.HIGHPORT to a wider range, and of course adjust your firewalls accordingly.

You might also be interested in the shared port functionality in 7.5 too, discussed this year at Condor Week.

Best,


matt

On 07/13/2010 11:09 AM, Philip Papadopoulos wrote:
> 
> 
> On Tue, Jul 13, 2010 at 11:08 AM, Philip Papadopoulos
> <philip.papadopoulos@xxxxxxxxx <mailto:philip.papadopoulos@xxxxxxxxx>>
> wrote:
> 
>     Gary sent me what he was doing and I was able to see it locally -- I
>     believe I have figured out the
>     problem, but need some verification from Condor folks.
>     New to this condor roll is setting the range of ports that Condor
>     uses. The range is set by default to 40000 - 40050.
>     It looks like that with many jobs started at once, various elements
>     of condor were running out of ports.
>     e.g in the ShadowLog file
>     07/13 10:07:57 (5.0) (10432): Request to run on compute-0-31-0.local
>     <10.1.255.218:40037 <http://10.1.255.218:40037>> was ACCEPTED
>     07/13 10:07:59 (5.0) (10432): Sock::bindWithin - failed to bind any
>     port within (40000 ~ 40050)
>     07/13 10:07:59 (5.0) (10432): RemoteResource::killStarter(): Could
>     not send command to startd
>     07/13 10:07:59 (5.0) (10432): Sock::bindWithin - failed to bind any
>     port within (40000 ~ 40050)
>     07/13 10:07:59 (5.0) (10432): Can't connect to queue manager:
>     CEDAR:6001:Failed to connect to <198.202.88.76:40040
>     <http://198.202.88.76:40040>>
> 
>     Removing the port range restriction completely on both the collector
>     and worker nodes seems to resolve the issue.
>     (Gary you can comment out PORTHIGH and PORTLOW commands in your
>     /opt/condor/etc/condor.config.local 
> 
> That should be LOWPORT and HIGHPORT  in condor_config.local
> -P
> 
>     files and restart Condor everywhere to see if resolves your
>     issue).   (I  also need to fix something in the Rocks roll to make
>     this work a little more intelligently. )
> 
>     Condor Team,
>     What seems to be a reasonable minimal size of the Condor Port range?
> 
>     -P
> 
> 
> 
>     On Tue, Jul 13, 2010 at 9:00 AM, Philip Papadopoulos
>     <philip.papadopoulos@xxxxxxxxx
>     <mailto:philip.papadopoulos@xxxxxxxxx>> wrote:
> 
>         Gary,
>         Can you put a tar file/instructions for me to download the same
>         thing you are doing and I will try to run on  a cluster here in
>         San Diego to see if I see the same results?
> 
>         Thanks,
>         Phil
> 
> 
> 
>         On Tue, Jul 13, 2010 at 8:56 AM, Gary Orser <garyorser@xxxxxxxxx
>         <mailto:garyorser@xxxxxxxxx>> wrote:
> 
>             Nope, added HOSTALLOW_READ, same symptom
>             on head /etc/init.d/rocks-condor restart
> 
> 
>             for i in `seq 1 100` ; do condor_submit
>             subs/ncbi++_blastp.sub ; done
> 
>             Submitting job(s).
>             Logging submit event(s).
>             1 job(s) submitted to cluster 106.
>             .
>             .
>             .
>             Logging submit event(s).
>             1 job(s) submitted to cluster 137.
> 
>             WARNING: File
>             /home/orser/tests/results/ncbi++_blastp.sub.137.0.err is not
>             writable by condor.
> 
>             WARNING: File
>             /home/orser/tests/results/ncbi++_blastp.sub.137.0.out is not
>             writable by condor.
> 
>             Can't send RESCHEDULE command to condor scheduler
>             Submitting job(s)
>             ERROR: Failed to connect to local queue manager
>             CEDAR:6001:Failed to connect to <153.90.184.186:40031
>             <http://153.90.184.186:40031>>
> 
> 
> 
>             On Tue, Jul 13, 2010 at 9:46 AM, Gary Orser
>             <garyorser@xxxxxxxxx <mailto:garyorser@xxxxxxxxx>> wrote:
> 
>                 [root@bugserv1 ~]# condor_config_val -dump | grep ALL
>                 ALL_DEBUG =
>                 ALLOW_ADMINISTRATOR = $(CONDOR_HOST)
>                 ALLOW_NEGOTIATOR = $(CONDOR_HOST)
>                 ALLOW_NEGOTIATOR_SCHEDD = $(CONDOR_HOST),
>                 $(FLOCK_NEGOTIATOR_HOSTS)
>                 ALLOW_OWNER = $(FULL_HOSTNAME), $(ALLOW_ADMINISTRATOR)
>                 ALLOW_READ = *
>                 ALLOW_READ_COLLECTOR = $(ALLOW_READ), $(FLOCK_FROM)
>                 ALLOW_READ_STARTD = $(ALLOW_READ), $(FLOCK_FROM)
>                 ALLOW_WRITE = $(HOSTALLOW_WRITE)
>                 ALLOW_WRITE_COLLECTOR = $(ALLOW_WRITE), $(FLOCK_FROM)
>                 ALLOW_WRITE_STARTD = $(ALLOW_WRITE), $(FLOCK_FROM)
>                 HOSTALLOW_WRITE = bugserv1.core.montana.edu
>                 <http://bugserv1.core.montana.edu>, *.local, *.local
>                 SMALLJOB = (TARGET.ImageSize <  (15 * 1024))
> 
>                 Looks like HOSTALLOW_READ is not set.  
>                 Is that the same as ALLOW_READ?
> 
>                 On Mon, Jul 12, 2010 at 6:16 PM, Mag Gam
>                 <magawake@xxxxxxxxx <mailto:magawake@xxxxxxxxx>> wrote:
> 
>                     can you type,
> 
>                     condor_config_val -dump ?
> 
>                     Looks like a security issue,
> 
>                     http://www.cs.wisc.edu/condor/manual/v7.4/3_6Security.html#sec:Host-Security
> 
>                     What does HOSTALLOW_READ and HOSTALL_WRITE look like?
> 
> 
> 
> 
> 
> 
>                     On Mon, Jul 12, 2010 at 1:12 PM, Dan Bradley
>                     <dan@xxxxxxxxxxxx <mailto:dan@xxxxxxxxxxxx>> wrote:
>                     > Gary,
>                     >
>                     > It may help to look in SchedLog to see what is
>                     happening to your
>                     > condor_schedd.
>                     >
>                     > --Dan
>                     >
>                     > Gary Orser wrote:
>                     >>
>                     >> Trying sending again ...
>                     >>
>                     >> On Fri, Jul 9, 2010 at 10:46 AM, Gary Orser
>                     <garyorser> wrote:
>                     >>
>                     >>    Hi all,
>                     >>
>                     >>    I just upgraded my cluster from Rocks 5.1 to 5.3.
>                     >>    This upgraded Condor from 7.2.? to 7.4.2.
>                     >>
>                     >>    I've got everything running, but it won't stay up.
>                     >>    (I have had the previous configuration running
>                     with condor for
>                     >>    years, done millions of hours of compute)
>                     >>
>                     >>    I have a good repeatable test case.
>                     >>    (each job runs for a couple of minutes)
>                     >>
>                     >>    [orser@bugserv1 tests]$ for i in `seq 1 100` ;
>                     do condor_submit
>                     >>    subs/ncbi++_blastp.sub ; done
>                     >>    Submitting job(s).
>                     >>    Logging submit event(s).
>                     >>    1 job(s) submitted to cluster 24.
>                     >>    Submitting job(s).
>                     >>    Logging submit event(s).
>                     >>    1 job(s) submitted to cluster 25.
>                     >>    Submitting job(s).
>                     >>    .
>                     >>    .
>                     >>    .
>                     >>    Submitting job(s).
>                     >>    Logging submit event(s).
>                     >>    1 job(s) submitted to cluster 53.
>                     >>
>                     >>    WARNING: File
>                     /home/orser/tests/results/ncbi++_blastp.sub.53.0.err
>                     >>    is not writable by condor.
>                     >>
>                     >>    WARNING: File
>                     /home/orser/tests/results/ncbi++_blastp.sub.53.0.out
>                     >>    is not writable by condor.
>                     >>    Can't send RESCHEDULE command to condor scheduler
>                     >>    Submitting job(s)
>                     >>    ERROR: Failed to connect to local queue manager
>                     >>    CEDAR:6001:Failed to connect to
>                     <153.90.184.186:40026 <http://153.90.184.186:40026>
>                     >>    <http://153.90.184.186:40026>>
>                     >>    Submitting job(s)
>                     >>    ERROR: Failed to connect to local queue manager
>                     >>    CEDAR:6001:Failed to connect to
>                     <153.90.184.186:40026 <http://153.90.184.186:40026>
>                     >>    <http://153.90.184.186:40026>>
>                     >>    Submitting job(s)
>                     >>
>                     >>    [orser@bugserv1 tests]$ cat subs/ncbi++_blastp.sub
>                     >>    ####################################
>                     >>    ## run distributed blast          ##
>                     >>    ## Condor submit description file ##
>                     >>    ####################################
>                     >>    getenv      = True
>                     >>    universe    = Vanilla
>                     >>    initialdir  = /home/orser/tests
>                     >>    executable  =
>                     /share/bio/ncbi-blast-2.2.22+/bin/blastn
>                     >>    input       = /dev/null
>                     >>    output      =
>                     results/ncbi++_blastp.sub.$(Cluster).$(Process).out
>                     >>    WhenToTransferOutput = ON_EXIT_OR_EVICT
>                     >>    error       =
>                     results/ncbi++_blastp.sub.$(Cluster).$(Process).err
>                     >>    log         =
>                     results/ncbi++_blastp.sub.$(Cluster).$(Process).log
>                     >>    notification = Error
>                     >>
>                     >>    arguments   = "-db /share/data/db/nt -query
>                     >>    /home/orser/tests/data/gdo0001.fas
>                     -culling_limit 20 -evalue 1E-5
>                     >>    -num_descriptions 10 -num_alignments 100
>                     -parse_deflines -show_gis
>                     >>    -outfmt 5"
>                     >>
>                     >>    queue
>                     >>
>                     >>    [root@bugserv1 etc]# condor_q
>                     >>
>                     >>    -- Failed to fetch ads from:
>                     <153.90.84.186:40026 <http://153.90.84.186:40026>
>                     >>    <http://153.90.84.186:40026>> :
>                     bugserv1.core.montana.edu
>                     <http://bugserv1.core.montana.edu>
>                     >>    <http://bugserv1.core.montana.edu>
>                     >>    CEDAR:6001:Failed to connect to
>                     <153.90.184.186:40026 <http://153.90.184.186:40026>
>                     >>    <http://153.90.184.186:40026>>
>                     >>
>                     >>
>                     >>    I can restart the head node with.
>                     >>    /etc/init.d/rocks-condor stop
>                     >>    rm -f /tmp/condor*/*
>                     >>    /etc/init.d/rocks-condor start
>                     >>
>                     >>    and the jobs that got submitted do run.
>                     >>
>                     >>    I have trawled through the archives, but
>                     haven't found anything
>                     >>    that might be useful.
>                     >>
>                     >>    I've looked at the logs, but not finding any
>                     clues there.
>                     >>    I can provide them if that might be useful.
>                     >>
>                     >>    The changes from a stock install, are minor.
>                     >>    (I just brought the cluster up this week)
>                     >>
>                     >>    [root@bugserv1 etc]# diff condor_config.local
>                     >>    condor_config.local.08Jul09
>                     >>    20c20
>                     >>    < LOCAL_DIR = /mnt/system/condor              
>                                        ---
>                     >>    > LOCAL_DIR = /var/opt/condor
>                     >> 27,29c27
>                     >>    < PREEMPT = True                              
>                                       <
>                     >> UWCS_PREEMPTION_REQUIREMENTS = ( $(StateTimer) >
>                     (8 * $(HOUR)) && \
>                     >>    <          RemoteUserPrio > SubmittorPrio *
>                     1.2 ) || (MY.NiceUser
>                     >>    == True)
>                     >>    ---
>                     >>    > PREEMPT = False
>                     >>
>                     >>    Just a bigger volume, and 8 hour preemption
>                     quanta.
>                     >>
>                     >>    Ideas?
>                     >>
>                     >>    --     Cheers, Gary
>                     >>    Systems Manager, Bioinformatics
>                     >>    Montana State University
>                     >>
>                     >>
>                     >>
>                     >>
>                     >> --
>                     >> Cheers, Gary
>                     >>
>                     >>
>                     >>
>                     ------------------------------------------------------------------------
>                     >>
>                     >> _______________________________________________
>                     >> Condor-users mailing list
>                     >> To unsubscribe, send a message to
>                     condor-users-request@xxxxxxxxxxx
>                     <mailto:condor-users-request@xxxxxxxxxxx> with a
>                     >> subject: Unsubscribe
>                     >> You can also unsubscribe by visiting
>                     >>
>                     https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>                     >>
>                     >> The archives can be found at:
>                     >> https://lists.cs.wisc.edu/archive/condor-users/
>                     >>
>                     >
>                     > _______________________________________________
>                     > Condor-users mailing list
>                     > To unsubscribe, send a message to
>                     condor-users-request@xxxxxxxxxxx
>                     <mailto:condor-users-request@xxxxxxxxxxx> with a
>                     > subject: Unsubscribe
>                     > You can also unsubscribe by visiting
>                     >
>                     https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>                     >
>                     > The archives can be found at:
>                     > https://lists.cs.wisc.edu/archive/condor-users/
>                     >
>                     _______________________________________________
>                     Condor-users mailing list
>                     To unsubscribe, send a message to
>                     condor-users-request@xxxxxxxxxxx
>                     <mailto:condor-users-request@xxxxxxxxxxx> with a
>                     subject: Unsubscribe
>                     You can also unsubscribe by visiting
>                     https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> 
>                     The archives can be found at:
>                     https://lists.cs.wisc.edu/archive/condor-users/
> 
> 
> 
> 
>                 -- 
>                 Cheers, Gary
> 
> 
> 
> 
> 
>             -- 
>             Cheers, Gary
> 
> 
> 
>             _______________________________________________
>             Condor-users mailing list
>             To unsubscribe, send a message to
>             condor-users-request@xxxxxxxxxxx
>             <mailto:condor-users-request@xxxxxxxxxxx> with a
>             subject: Unsubscribe
>             You can also unsubscribe by visiting
>             https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> 
>             The archives can be found at:
>             https://lists.cs.wisc.edu/archive/condor-users/
> 
> 
> 
> 
>         -- 
>         Philip Papadopoulos, PhD
>         University of California, San Diego
>         858-822-3628 (Ofc)
>         619-331-2990 (Fax)
> 
> 
> 
> 
>     -- 
>     Philip Papadopoulos, PhD
>     University of California, San Diego
>     858-822-3628 (Ofc)
>     619-331-2990 (Fax)
> 
> 
> 
> 
> -- 
> Philip Papadopoulos, PhD
> University of California, San Diego
> 858-822-3628 (Ofc)
> 619-331-2990 (Fax)
> 
> 
> 
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/