[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Condor 7.4.2 not working on Rocks 5.3




There is a discrepancy between the schedd port requirements that I quoted and what Matt quoted. Ironically, he was quoting me!

The manual basically claims you need 5 * running_jobs ports. For long running jobs (~6 hour jobs), I have found the port requirements to be less than 3 * running_jobs, and that is what I wrote on the wiki page. However, for short jobs, the churn of condor_shadow processes coming and going will cause a non-negligible number of ports to be sitting in the TIME_WAIT state at any given time. Therefore, it may be better to be conservative and use the Condor manual's higher estimate of 5 * running_jobs. I'll update the wiki to explain this.

Fortunately, in Condor 7.5.3, this issue of short jobs effectively chewing up more ports has been largely removed, because the shadow process is reused to run multiple jobs.

--Dan

Dan Bradley wrote:

In 7.5.0, there is a new feature condor_shared_port, which can be used to reduce incoming port usage down to a port or two. Otherwise, the number of ports required is quite large. The following page in the manual gives recommendations:

http://www.cs.wisc.edu/condor/manual/v7.4/3_7Networking_includes.html#SECTION00471400000000000000

The most important statement is this:

"The Submit machines (those machines running a condor_schedd daemon) require 5 + (5 * MAX_JOBS_RUNNING) ports."

If you are only concerned about restricting the incoming (or outgoing) port range, then I'd recommend setting IN/OUT_LOWPORT and IN/OUT_HIGHPORT rather than LOWPORT and HIGHPORT.

--Dan

Philip Papadopoulos wrote:
Gary sent me what he was doing and I was able to see it locally -- I believe I have figured out the
problem, but need some verification from Condor folks.
New to this condor roll is setting the range of ports that Condor uses. The range is set by default to 40000 - 40050. It looks like that with many jobs started at once, various elements of condor were running out of ports.
e.g in the ShadowLog file
07/13 10:07:57 (5.0) (10432): Request to run on compute-0-31-0.local <10.1.255.218:40037 <http://10.1.255.218:40037>> was ACCEPTED 07/13 10:07:59 (5.0) (10432): Sock::bindWithin - failed to bind any port within (40000 ~ 40050) 07/13 10:07:59 (5.0) (10432): RemoteResource::killStarter(): Could not send command to startd 07/13 10:07:59 (5.0) (10432): Sock::bindWithin - failed to bind any port within (40000 ~ 40050) 07/13 10:07:59 (5.0) (10432): Can't connect to queue manager: CEDAR:6001:Failed to connect to <198.202.88.76:40040 <http://198.202.88.76:40040>>

Removing the port range restriction completely on both the collector and worker nodes seems to resolve the issue. (Gary you can comment out PORTHIGH and PORTLOW commands in your /opt/condor/etc/condor.config.local files and restart Condor everywhere to see if resolves your issue). (I also need to fix something in the Rocks roll to make this work a little more intelligently. )

Condor Team,
What seems to be a reasonable minimal size of the Condor Port range?

-P


On Tue, Jul 13, 2010 at 9:00 AM, Philip Papadopoulos <philip.papadopoulos@xxxxxxxxx <mailto:philip.papadopoulos@xxxxxxxxx>> wrote:

    Gary,
    Can you put a tar file/instructions for me to download the same
    thing you are doing and I will try to run on  a cluster here in
    San Diego to see if I see the same results?

    Thanks,
    Phil



    On Tue, Jul 13, 2010 at 8:56 AM, Gary Orser <garyorser@xxxxxxxxx
    <mailto:garyorser@xxxxxxxxx>> wrote:

        Nope, added HOSTALLOW_READ, same symptom
        on head /etc/init.d/rocks-condor restart


        for i in `seq 1 100` ; do condor_submit subs/ncbi++_blastp.sub
        ; done

        Submitting job(s).
        Logging submit event(s).
        1 job(s) submitted to cluster 106.
        .
        .
        .
        Logging submit event(s).
        1 job(s) submitted to cluster 137.

        WARNING: File
        /home/orser/tests/results/ncbi++_blastp.sub.137.0.err is not
        writable by condor.

        WARNING: File
        /home/orser/tests/results/ncbi++_blastp.sub.137.0.out is not
        writable by condor.

        Can't send RESCHEDULE command to condor scheduler
        Submitting job(s)
        ERROR: Failed to connect to local queue manager
        CEDAR:6001:Failed to connect to <153.90.184.186:40031
        <http://153.90.184.186:40031>>



        On Tue, Jul 13, 2010 at 9:46 AM, Gary Orser
        <garyorser@xxxxxxxxx <mailto:garyorser@xxxxxxxxx>> wrote:

            [root@bugserv1 ~]# condor_config_val -dump | grep ALL
            ALL_DEBUG =
            ALLOW_ADMINISTRATOR = $(CONDOR_HOST)
            ALLOW_NEGOTIATOR = $(CONDOR_HOST)
            ALLOW_NEGOTIATOR_SCHEDD = $(CONDOR_HOST),
            $(FLOCK_NEGOTIATOR_HOSTS)
            ALLOW_OWNER = $(FULL_HOSTNAME), $(ALLOW_ADMINISTRATOR)
            ALLOW_READ = *
            ALLOW_READ_COLLECTOR = $(ALLOW_READ), $(FLOCK_FROM)
            ALLOW_READ_STARTD = $(ALLOW_READ), $(FLOCK_FROM)
            ALLOW_WRITE = $(HOSTALLOW_WRITE)
            ALLOW_WRITE_COLLECTOR = $(ALLOW_WRITE), $(FLOCK_FROM)
            ALLOW_WRITE_STARTD = $(ALLOW_WRITE), $(FLOCK_FROM)
            HOSTALLOW_WRITE = bugserv1.core.montana.edu
            <http://bugserv1.core.montana.edu>, *.local, *.local
            SMALLJOB = (TARGET.ImageSize <  (15 * 1024))

Looks like HOSTALLOW_READ is not set. Is that the same as ALLOW_READ?

            On Mon, Jul 12, 2010 at 6:16 PM, Mag Gam
            <magawake@xxxxxxxxx <mailto:magawake@xxxxxxxxx>> wrote:

                can you type,

                condor_config_val -dump ?

                Looks like a security issue,

http://www.cs.wisc.edu/condor/manual/v7.4/3_6Security.html#sec:Host-Security

                What does HOSTALLOW_READ and HOSTALL_WRITE look like?






                On Mon, Jul 12, 2010 at 1:12 PM, Dan Bradley
                <dan@xxxxxxxxxxxx <mailto:dan@xxxxxxxxxxxx>> wrote:
                > Gary,
                >
                > It may help to look in SchedLog to see what is
                happening to your
                > condor_schedd.
                >
                > --Dan
                >
                > Gary Orser wrote:
                >>
                >> Trying sending again ...
                >>
                >> On Fri, Jul 9, 2010 at 10:46 AM, Gary Orser
                <garyorser> wrote:
                >>
                >>    Hi all,
                >>
                >>    I just upgraded my cluster from Rocks 5.1 to 5.3.
                >>    This upgraded Condor from 7.2.? to 7.4.2.
                >>
                >>    I've got everything running, but it won't stay up.
                >>    (I have had the previous configuration running
                with condor for
                >>    years, done millions of hours of compute)
                >>
                >>    I have a good repeatable test case.
                >>    (each job runs for a couple of minutes)
                >>
                >>    [orser@bugserv1 tests]$ for i in `seq 1 100` ;
                do condor_submit
                >>    subs/ncbi++_blastp.sub ; done
                >>    Submitting job(s).
                >>    Logging submit event(s).
                >>    1 job(s) submitted to cluster 24.
                >>    Submitting job(s).
                >>    Logging submit event(s).
                >>    1 job(s) submitted to cluster 25.
                >>    Submitting job(s).
                >>    .
                >>    .
                >>    .
                >>    Submitting job(s).
                >>    Logging submit event(s).
                >>    1 job(s) submitted to cluster 53.
                >>
                >>    WARNING: File
                /home/orser/tests/results/ncbi++_blastp.sub.53.0.err
                >>    is not writable by condor.
                >>
                >>    WARNING: File
                /home/orser/tests/results/ncbi++_blastp.sub.53.0.out
                >>    is not writable by condor.
                >>    Can't send RESCHEDULE command to condor scheduler
                >>    Submitting job(s)
                >>    ERROR: Failed to connect to local queue manager
                >>    CEDAR:6001:Failed to connect to
                <153.90.184.186:40026 <http://153.90.184.186:40026>
                >>    <http://153.90.184.186:40026>>
                >>    Submitting job(s)
                >>    ERROR: Failed to connect to local queue manager
                >>    CEDAR:6001:Failed to connect to
                <153.90.184.186:40026 <http://153.90.184.186:40026>
                >>    <http://153.90.184.186:40026>>
                >>    Submitting job(s)
                >>
                >>    [orser@bugserv1 tests]$ cat subs/ncbi++_blastp.sub
                >>    ####################################
                >>    ## run distributed blast          ##
                >>    ## Condor submit description file ##
                >>    ####################################
                >>    getenv      = True
                >>    universe    = Vanilla
                >>    initialdir  = /home/orser/tests
                >>    executable  =
                /share/bio/ncbi-blast-2.2.22+/bin/blastn
                >>    input       = /dev/null
                >>    output      =
                results/ncbi++_blastp.sub.$(Cluster).$(Process).out
                >>    WhenToTransferOutput = ON_EXIT_OR_EVICT
                >>    error       =
                results/ncbi++_blastp.sub.$(Cluster).$(Process).err
                >>    log         =
                results/ncbi++_blastp.sub.$(Cluster).$(Process).log
                >>    notification = Error
                >>
                >>    arguments   = "-db /share/data/db/nt -query
                >>    /home/orser/tests/data/gdo0001.fas
                -culling_limit 20 -evalue 1E-5
                >>    -num_descriptions 10 -num_alignments 100
                -parse_deflines -show_gis
                >>    -outfmt 5"
                >>
                >>    queue
                >>
                >>    [root@bugserv1 etc]# condor_q
                >>
                >>    -- Failed to fetch ads from:
                <153.90.84.186:40026 <http://153.90.84.186:40026>
                >>    <http://153.90.84.186:40026>> :
                bugserv1.core.montana.edu
                <http://bugserv1.core.montana.edu>
                >>    <http://bugserv1.core.montana.edu>
                >>    CEDAR:6001:Failed to connect to
                <153.90.184.186:40026 <http://153.90.184.186:40026>
                >>    <http://153.90.184.186:40026>>
                >>
                >>
                >>    I can restart the head node with.
                >>    /etc/init.d/rocks-condor stop
                >>    rm -f /tmp/condor*/*
                >>    /etc/init.d/rocks-condor start
                >>
                >>    and the jobs that got submitted do run.
                >>
                >>    I have trawled through the archives, but haven't
                found anything
                >>    that might be useful.
                >>
                >>    I've looked at the logs, but not finding any
                clues there.
                >>    I can provide them if that might be useful.
                >>
                >>    The changes from a stock install, are minor.
                >>    (I just brought the cluster up this week)
                >>
                >>    [root@bugserv1 etc]# diff condor_config.local
                >>    condor_config.local.08Jul09
                >>    20c20
>> < LOCAL_DIR = /mnt/system/condor ---
                >>    > LOCAL_DIR = /var/opt/condor
                >> 27,29c27
>> < PREEMPT = True <
                >> UWCS_PREEMPTION_REQUIREMENTS = ( $(StateTimer) > (8
                * $(HOUR)) && \
                >>    <          RemoteUserPrio > SubmittorPrio * 1.2
                ) || (MY.NiceUser
                >>    == True)
                >>    ---
                >>    > PREEMPT = False
                >>
>> Just a bigger volume, and 8 hour preemption quanta.
                >>
                >>    Ideas?
                >>
                >>    --     Cheers, Gary
                >>    Systems Manager, Bioinformatics
                >>    Montana State University
                >>
                >>
                >>
                >>
                >> --
                >> Cheers, Gary
                >>
                >>
                >>
------------------------------------------------------------------------
                >>
                >> _______________________________________________
                >> Condor-users mailing list
                >> To unsubscribe, send a message to
                condor-users-request@xxxxxxxxxxx
                <mailto:condor-users-request@xxxxxxxxxxx> with a
                >> subject: Unsubscribe
                >> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
                >>
                >> The archives can be found at:
                >> https://lists.cs.wisc.edu/archive/condor-users/
                >>
                >
                > _______________________________________________
                > Condor-users mailing list
                > To unsubscribe, send a message to
                condor-users-request@xxxxxxxxxxx
                <mailto:condor-users-request@xxxxxxxxxxx> with a
                > subject: Unsubscribe
                > You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
                >
                > The archives can be found at:
                > https://lists.cs.wisc.edu/archive/condor-users/
                >
                _______________________________________________
                Condor-users mailing list
                To unsubscribe, send a message to
                condor-users-request@xxxxxxxxxxx
                <mailto:condor-users-request@xxxxxxxxxxx> with a
                subject: Unsubscribe
                You can also unsubscribe by visiting
                https://lists.cs.wisc.edu/mailman/listinfo/condor-users

                The archives can be found at:
                https://lists.cs.wisc.edu/archive/condor-users/




            --             Cheers, Gary





        --         Cheers, Gary



        _______________________________________________
        Condor-users mailing list
        To unsubscribe, send a message to
        condor-users-request@xxxxxxxxxxx
        <mailto:condor-users-request@xxxxxxxxxxx> with a
        subject: Unsubscribe
        You can also unsubscribe by visiting
        https://lists.cs.wisc.edu/mailman/listinfo/condor-users

        The archives can be found at:
        https://lists.cs.wisc.edu/archive/condor-users/




    --     Philip Papadopoulos, PhD
    University of California, San Diego
    858-822-3628 (Ofc)
    619-331-2990 (Fax)




--
Philip Papadopoulos, PhD
University of California, San Diego
858-822-3628 (Ofc)
619-331-2990 (Fax)
------------------------------------------------------------------------

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/