Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Condor 7.4.2 not working on Rocks 5.3

Date: Tue, 13 Jul 2010 14:11:35 -0500
From: Dan Bradley <dan@xxxxxxxxxxxx>
Subject: Re: [Condor-users] Condor 7.4.2 not working on Rocks 5.3

There is a discrepancy between the schedd port requirements that Iquoted and what Matt quoted. Ironically, he was quoting me!

The manual basically claims you need 5 * running_jobs ports. For longrunning jobs (~6 hour jobs), I have found the port requirements to beless than 3 * running_jobs, and that is what I wrote on the wiki page.However, for short jobs, the churn of condor_shadow processes coming andgoing will cause a non-negligible number of ports to be sitting in theTIME_WAIT state at any given time. Therefore, it may be better to beconservative and use the Condor manual's higher estimate of 5 *running_jobs. I'll update the wiki to explain this.

Fortunately, in Condor 7.5.3, this issue of short jobs effectivelychewing up more ports has been largely removed, because the shadowprocess is reused to run multiple jobs.


--Dan

Dan Bradley wrote:

In 7.5.0, there is a new feature condor_shared_port, which can be usedto reduce incoming port usage down to a port or two. Otherwise, thenumber of ports required is quite large. The following page in themanual gives recommendations:

http://www.cs.wisc.edu/condor/manual/v7.4/3_7Networking_includes.html#SECTION00471400000000000000


The most important statement is this:

"The Submit machines (those machines running a condor_schedd daemon)require 5 + (5 * MAX_JOBS_RUNNING) ports."

If you are only concerned about restricting the incoming (or outgoing)port range, then I'd recommend setting IN/OUT_LOWPORT andIN/OUT_HIGHPORT rather than LOWPORT and HIGHPORT.


--Dan

Philip Papadopoulos wrote:

Gary sent me what he was doing and I was able to see it locally -- Ibelieve I have figured out the

problem, but need some verification from Condor folks.

New to this condor roll is setting the range of ports that Condoruses. The range is set by default to 40000 - 40050.It looks like that with many jobs started at once, various elementsof condor were running out of ports.

e.g in the ShadowLog file

07/13 10:07:57 (5.0) (10432): Request to run on compute-0-31-0.local<10.1.255.218:40037 <http://10.1.255.218:40037>> was ACCEPTED07/13 10:07:59 (5.0) (10432): Sock::bindWithin - failed to bind anyport within (40000 ~ 40050)07/13 10:07:59 (5.0) (10432): RemoteResource::killStarter(): Couldnot send command to startd07/13 10:07:59 (5.0) (10432): Sock::bindWithin - failed to bind anyport within (40000 ~ 40050)07/13 10:07:59 (5.0) (10432): Can't connect to queue manager:CEDAR:6001:Failed to connect to <198.202.88.76:40040<http://198.202.88.76:40040>>

Removing the port range restriction completely on both the collectorand worker nodes seems to resolve the issue.(Gary you can comment out PORTHIGH and PORTLOW commands in your/opt/condor/etc/condor.config.local files and restart Condoreverywhere to see if resolves your issue). (I also need to fixsomething in the Rocks roll to make this work a little moreintelligently. )


Condor Team,
What seems to be a reasonable minimal size of the Condor Port range?

-P

On Tue, Jul 13, 2010 at 9:00 AM, Philip Papadopoulos<philip.papadopoulos@xxxxxxxxx<mailto:philip.papadopoulos@xxxxxxxxx>> wrote:


    Gary,
    Can you put a tar file/instructions for me to download the same
    thing you are doing and I will try to run on  a cluster here in
    San Diego to see if I see the same results?

    Thanks,
    Phil



    On Tue, Jul 13, 2010 at 8:56 AM, Gary Orser <garyorser@xxxxxxxxx
    <mailto:garyorser@xxxxxxxxx>> wrote:

        Nope, added HOSTALLOW_READ, same symptom
        on head /etc/init.d/rocks-condor restart


        for i in `seq 1 100` ; do condor_submit subs/ncbi++_blastp.sub
        ; done

        Submitting job(s).
        Logging submit event(s).
        1 job(s) submitted to cluster 106.
        .
        .
        .
        Logging submit event(s).
        1 job(s) submitted to cluster 137.

        WARNING: File
        /home/orser/tests/results/ncbi++_blastp.sub.137.0.err is not
        writable by condor.

        WARNING: File
        /home/orser/tests/results/ncbi++_blastp.sub.137.0.out is not
        writable by condor.

        Can't send RESCHEDULE command to condor scheduler
        Submitting job(s)
        ERROR: Failed to connect to local queue manager
        CEDAR:6001:Failed to connect to <153.90.184.186:40031
        <http://153.90.184.186:40031>>



        On Tue, Jul 13, 2010 at 9:46 AM, Gary Orser
        <garyorser@xxxxxxxxx <mailto:garyorser@xxxxxxxxx>> wrote:

            [root@bugserv1 ~]# condor_config_val -dump | grep ALL
            ALL_DEBUG =
            ALLOW_ADMINISTRATOR = $(CONDOR_HOST)
            ALLOW_NEGOTIATOR = $(CONDOR_HOST)
            ALLOW_NEGOTIATOR_SCHEDD = $(CONDOR_HOST),
            $(FLOCK_NEGOTIATOR_HOSTS)
            ALLOW_OWNER = $(FULL_HOSTNAME), $(ALLOW_ADMINISTRATOR)
            ALLOW_READ = *
            ALLOW_READ_COLLECTOR = $(ALLOW_READ), $(FLOCK_FROM)
            ALLOW_READ_STARTD = $(ALLOW_READ), $(FLOCK_FROM)
            ALLOW_WRITE = $(HOSTALLOW_WRITE)
            ALLOW_WRITE_COLLECTOR = $(ALLOW_WRITE), $(FLOCK_FROM)
            ALLOW_WRITE_STARTD = $(ALLOW_WRITE), $(FLOCK_FROM)
            HOSTALLOW_WRITE = bugserv1.core.montana.edu
            <http://bugserv1.core.montana.edu>, *.local, *.local
            SMALLJOB = (TARGET.ImageSize <  (15 * 1024))

Looks like HOSTALLOW_READ is not set. Isthat the same as ALLOW_READ?


            On Mon, Jul 12, 2010 at 6:16 PM, Mag Gam
            <magawake@xxxxxxxxx <mailto:magawake@xxxxxxxxx>> wrote:

                can you type,

                condor_config_val -dump ?

                Looks like a security issue,

http://www.cs.wisc.edu/condor/manual/v7.4/3_6Security.html#sec:Host-Security


                What does HOSTALLOW_READ and HOSTALL_WRITE look like?






                On Mon, Jul 12, 2010 at 1:12 PM, Dan Bradley
                <dan@xxxxxxxxxxxx <mailto:dan@xxxxxxxxxxxx>> wrote:
                > Gary,
                >
                > It may help to look in SchedLog to see what is
                happening to your
                > condor_schedd.
                >
                > --Dan
                >
                > Gary Orser wrote:
                >>
                >> Trying sending again ...
                >>
                >> On Fri, Jul 9, 2010 at 10:46 AM, Gary Orser
                <garyorser> wrote:
                >>
                >>    Hi all,
                >>
                >>    I just upgraded my cluster from Rocks 5.1 to 5.3.
                >>    This upgraded Condor from 7.2.? to 7.4.2.
                >>
                >>    I've got everything running, but it won't stay up.
                >>    (I have had the previous configuration running
                with condor for
                >>    years, done millions of hours of compute)
                >>
                >>    I have a good repeatable test case.
                >>    (each job runs for a couple of minutes)
                >>
                >>    [orser@bugserv1 tests]$ for i in `seq 1 100` ;
                do condor_submit
                >>    subs/ncbi++_blastp.sub ; done
                >>    Submitting job(s).
                >>    Logging submit event(s).
                >>    1 job(s) submitted to cluster 24.
                >>    Submitting job(s).
                >>    Logging submit event(s).
                >>    1 job(s) submitted to cluster 25.
                >>    Submitting job(s).
                >>    .
                >>    .
                >>    .
                >>    Submitting job(s).
                >>    Logging submit event(s).
                >>    1 job(s) submitted to cluster 53.
                >>
                >>    WARNING: File
                /home/orser/tests/results/ncbi++_blastp.sub.53.0.err
                >>    is not writable by condor.
                >>
                >>    WARNING: File
                /home/orser/tests/results/ncbi++_blastp.sub.53.0.out
                >>    is not writable by condor.
                >>    Can't send RESCHEDULE command to condor scheduler
                >>    Submitting job(s)
                >>    ERROR: Failed to connect to local queue manager
                >>    CEDAR:6001:Failed to connect to
                <153.90.184.186:40026 <http://153.90.184.186:40026>
                >>    <http://153.90.184.186:40026>>
                >>    Submitting job(s)
                >>    ERROR: Failed to connect to local queue manager
                >>    CEDAR:6001:Failed to connect to
                <153.90.184.186:40026 <http://153.90.184.186:40026>
                >>    <http://153.90.184.186:40026>>
                >>    Submitting job(s)
                >>
                >>    [orser@bugserv1 tests]$ cat subs/ncbi++_blastp.sub
                >>    ####################################
                >>    ## run distributed blast          ##
                >>    ## Condor submit description file ##
                >>    ####################################
                >>    getenv      = True
                >>    universe    = Vanilla
                >>    initialdir  = /home/orser/tests
                >>    executable  =
                /share/bio/ncbi-blast-2.2.22+/bin/blastn
                >>    input       = /dev/null
                >>    output      =
                results/ncbi++_blastp.sub.$(Cluster).$(Process).out
                >>    WhenToTransferOutput = ON_EXIT_OR_EVICT
                >>    error       =
                results/ncbi++_blastp.sub.$(Cluster).$(Process).err
                >>    log         =
                results/ncbi++_blastp.sub.$(Cluster).$(Process).log
                >>    notification = Error
                >>
                >>    arguments   = "-db /share/data/db/nt -query
                >>    /home/orser/tests/data/gdo0001.fas
                -culling_limit 20 -evalue 1E-5
                >>    -num_descriptions 10 -num_alignments 100
                -parse_deflines -show_gis
                >>    -outfmt 5"
                >>
                >>    queue
                >>
                >>    [root@bugserv1 etc]# condor_q
                >>
                >>    -- Failed to fetch ads from:
                <153.90.84.186:40026 <http://153.90.84.186:40026>
                >>    <http://153.90.84.186:40026>> :
                bugserv1.core.montana.edu
                <http://bugserv1.core.montana.edu>
                >>    <http://bugserv1.core.montana.edu>
                >>    CEDAR:6001:Failed to connect to
                <153.90.184.186:40026 <http://153.90.184.186:40026>
                >>    <http://153.90.184.186:40026>>
                >>
                >>
                >>    I can restart the head node with.
                >>    /etc/init.d/rocks-condor stop
                >>    rm -f /tmp/condor*/*
                >>    /etc/init.d/rocks-condor start
                >>
                >>    and the jobs that got submitted do run.
                >>
                >>    I have trawled through the archives, but haven't
                found anything
                >>    that might be useful.
                >>
                >>    I've looked at the logs, but not finding any
                clues there.
                >>    I can provide them if that might be useful.
                >>
                >>    The changes from a stock install, are minor.
                >>    (I just brought the cluster up this week)
                >>
                >>    [root@bugserv1 etc]# diff condor_config.local
                >>    condor_config.local.08Jul09
                >>    20c20

>> < LOCAL_DIR = /mnt/system/condor---

                >>    > LOCAL_DIR = /var/opt/condor
                >> 27,29c27

>> < PREEMPT = True<

                >> UWCS_PREEMPTION_REQUIREMENTS = ( $(StateTimer) > (8
                * $(HOUR)) && \
                >>    <          RemoteUserPrio > SubmittorPrio * 1.2
                ) || (MY.NiceUser
                >>    == True)
                >>    ---
                >>    > PREEMPT = False
                >>

>> Just a bigger volume, and 8 hour preemptionquanta.

                >>
                >>    Ideas?
                >>
                >>    --     Cheers, Gary
                >>    Systems Manager, Bioinformatics
                >>    Montana State University
                >>
                >>
                >>
                >>
                >> --
                >> Cheers, Gary
                >>
                >>
                >>

------------------------------------------------------------------------

                >>
                >> _______________________________________________
                >> Condor-users mailing list
                >> To unsubscribe, send a message to
                condor-users-request@xxxxxxxxxxx
                <mailto:condor-users-request@xxxxxxxxxxx> with a
                >> subject: Unsubscribe
                >> You can also unsubscribe by visiting

>>https://lists.cs.wisc.edu/mailman/listinfo/condor-users

                >>
                >> The archives can be found at:
                >> https://lists.cs.wisc.edu/archive/condor-users/
                >>
                >
                > _______________________________________________
                > Condor-users mailing list
                > To unsubscribe, send a message to
                condor-users-request@xxxxxxxxxxx
                <mailto:condor-users-request@xxxxxxxxxxx> with a
                > subject: Unsubscribe
                > You can also unsubscribe by visiting

>https://lists.cs.wisc.edu/mailman/listinfo/condor-users

                >
                > The archives can be found at:
                > https://lists.cs.wisc.edu/archive/condor-users/
                >
                _______________________________________________
                Condor-users mailing list
                To unsubscribe, send a message to
                condor-users-request@xxxxxxxxxxx
                <mailto:condor-users-request@xxxxxxxxxxx> with a
                subject: Unsubscribe
                You can also unsubscribe by visiting
                https://lists.cs.wisc.edu/mailman/listinfo/condor-users

                The archives can be found at:
                https://lists.cs.wisc.edu/archive/condor-users/




            --             Cheers, Gary





        --         Cheers, Gary



        _______________________________________________
        Condor-users mailing list
        To unsubscribe, send a message to
        condor-users-request@xxxxxxxxxxx
        <mailto:condor-users-request@xxxxxxxxxxx> with a
        subject: Unsubscribe
        You can also unsubscribe by visiting
        https://lists.cs.wisc.edu/mailman/listinfo/condor-users

        The archives can be found at:
        https://lists.cs.wisc.edu/archive/condor-users/




    --     Philip Papadopoulos, PhD
    University of California, San Diego
    858-822-3628 (Ofc)
    619-331-2990 (Fax)




--
Philip Papadopoulos, PhD
University of California, San Diego
858-822-3628 (Ofc)
619-331-2990 (Fax)
------------------------------------------------------------------------

_______________________________________________
Condor-users mailing list

To unsubscribe, send a message to condor-users-request@xxxxxxxxxxxwith a

subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/

References:
- Re: [Condor-users] Condor 7.4.2 not working on Rocks 5.3
  - From: Gary Orser
- Re: [Condor-users] Condor 7.4.2 not working on Rocks 5.3
  - From: Dan Bradley
- Re: [Condor-users] Condor 7.4.2 not working on Rocks 5.3
  - From: Mag Gam
- Re: [Condor-users] Condor 7.4.2 not working on Rocks 5.3
  - From: Gary Orser
- Re: [Condor-users] Condor 7.4.2 not working on Rocks 5.3
  - From: Gary Orser
- Re: [Condor-users] Condor 7.4.2 not working on Rocks 5.3
  - From: Philip Papadopoulos
- Re: [Condor-users] Condor 7.4.2 not working on Rocks 5.3
  - From: Philip Papadopoulos
- Re: [Condor-users] Condor 7.4.2 not working on Rocks 5.3
  - From: Dan Bradley

Prev by Date: Re: [Condor-users] Setting up credentials on a Windows box for the pool manager
Next by Date: [Condor-users] Do you have experience with Quill? I would like to talk with you.
Previous by thread: Re: [Condor-users] Condor 7.4.2 not working on Rocks 5.3
Next by thread: Re: [Condor-users] Condor 7.4.2 not working on Rocks 5.3
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [Condor-users] Condor 7.4.2 not working on Rocks 5.3