[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Condor 7.4.2 not working on Rocks 5.3



Gary sent me what he was doing and I was able to see it locally -- I believe I have figured out the
problem, but need some verification from Condor folks.
New to this condor roll is setting the range of ports that Condor uses. The range is set by default to 40000 - 40050.
It looks like that with many jobs started at once, various elements of condor were running out of ports.
e.g in the ShadowLog file
07/13 10:07:57 (5.0) (10432): Request to run on compute-0-31-0.local <10.1.255.218:40037> was ACCEPTED
07/13 10:07:59 (5.0) (10432): Sock::bindWithin - failed to bind any port within (40000 ~ 40050)
07/13 10:07:59 (5.0) (10432): RemoteResource::killStarter(): Could not send command to startd
07/13 10:07:59 (5.0) (10432): Sock::bindWithin - failed to bind any port within (40000 ~ 40050)
07/13 10:07:59 (5.0) (10432): Can't connect to queue manager: CEDAR:6001:Failed to connect to <198.202.88.76:40040>

Removing the port range restriction completely on both the collector and worker nodes seems to resolve the issue.
(Gary you can comment out PORTHIGH and PORTLOW commands in your /opt/condor/etc/condor.config.local files and restart Condor everywhere to see if resolves your issue).   (I  also need to fix something in the Rocks roll to make this work a little more intelligently. )

Condor Team,
What seems to be a reasonable minimal size of the Condor Port range?

-P


On Tue, Jul 13, 2010 at 9:00 AM, Philip Papadopoulos <philip.papadopoulos@xxxxxxxxx> wrote:
Gary,
Can you put a tar file/instructions for me to download the same thing you are doing and I will try to run on  a cluster here in San Diego to see if I see the same results?

Thanks,
Phil



On Tue, Jul 13, 2010 at 8:56 AM, Gary Orser <garyorser@xxxxxxxxx> wrote:
Nope, added HOSTALLOW_READ, same symptom
on head /etc/init.d/rocks-condor restart


for i in `seq 1 100` ; do condor_submit subs/ncbi++_blastp.sub ; done

Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 106.
.
.
.
Logging submit event(s).
1 job(s) submitted to cluster 137.

WARNING: File /home/orser/tests/results/ncbi++_blastp.sub.137.0.err is not writable by condor.

WARNING: File /home/orser/tests/results/ncbi++_blastp.sub.137.0.out is not writable by condor.

Can't send RESCHEDULE command to condor scheduler
Submitting job(s)
ERROR: Failed to connect to local queue manager
CEDAR:6001:Failed to connect to <153.90.184.186:40031>



On Tue, Jul 13, 2010 at 9:46 AM, Gary Orser <garyorser@xxxxxxxxx> wrote:
[root@bugserv1 ~]# condor_config_val -dump | grep ALL
ALL_DEBUG =
ALLOW_ADMINISTRATOR = $(CONDOR_HOST)
ALLOW_NEGOTIATOR = $(CONDOR_HOST)
ALLOW_NEGOTIATOR_SCHEDD = $(CONDOR_HOST), $(FLOCK_NEGOTIATOR_HOSTS)
ALLOW_OWNER = $(FULL_HOSTNAME), $(ALLOW_ADMINISTRATOR)
ALLOW_READ = *
ALLOW_READ_COLLECTOR = $(ALLOW_READ), $(FLOCK_FROM)
ALLOW_READ_STARTD = $(ALLOW_READ), $(FLOCK_FROM)
ALLOW_WRITE = $(HOSTALLOW_WRITE)
ALLOW_WRITE_COLLECTOR = $(ALLOW_WRITE), $(FLOCK_FROM)
ALLOW_WRITE_STARTD = $(ALLOW_WRITE), $(FLOCK_FROM)
HOSTALLOW_WRITE = bugserv1.core.montana.edu, *.local, *.local
SMALLJOB = (TARGET.ImageSize <  (15 * 1024))

Looks like HOSTALLOW_READ is not set.  
Is that the same as ALLOW_READ?

On Mon, Jul 12, 2010 at 6:16 PM, Mag Gam <magawake@xxxxxxxxx> wrote:
can you type,

condor_config_val -dump ?

Looks like a security issue,

http://www.cs.wisc.edu/condor/manual/v7.4/3_6Security.html#sec:Host-Security

What does HOSTALLOW_READ and HOSTALL_WRITE look like?






On Mon, Jul 12, 2010 at 1:12 PM, Dan Bradley <dan@xxxxxxxxxxxx> wrote:
> Gary,
>
> It may help to look in SchedLog to see what is happening to your
> condor_schedd.
>
> --Dan
>
> Gary Orser wrote:
>>
>> Trying sending again ...
>>
>> On Fri, Jul 9, 2010 at 10:46 AM, Gary Orser <garyorser> wrote:
>>
>>    Hi all,
>>
>>    I just upgraded my cluster from Rocks 5.1 to 5.3.
>>    This upgraded Condor from 7.2.? to 7.4.2.
>>
>>    I've got everything running, but it won't stay up.
>>    (I have had the previous configuration running with condor for
>>    years, done millions of hours of compute)
>>
>>    I have a good repeatable test case.
>>    (each job runs for a couple of minutes)
>>
>>    [orser@bugserv1 tests]$ for i in `seq 1 100` ; do condor_submit
>>    subs/ncbi++_blastp.sub ; done
>>    Submitting job(s).
>>    Logging submit event(s).
>>    1 job(s) submitted to cluster 24.
>>    Submitting job(s).
>>    Logging submit event(s).
>>    1 job(s) submitted to cluster 25.
>>    Submitting job(s).
>>    .
>>    .
>>    .
>>    Submitting job(s).
>>    Logging submit event(s).
>>    1 job(s) submitted to cluster 53.
>>
>>    WARNING: File /home/orser/tests/results/ncbi++_blastp.sub.53.0.err
>>    is not writable by condor.
>>
>>    WARNING: File /home/orser/tests/results/ncbi++_blastp.sub.53.0.out
>>    is not writable by condor.
>>    Can't send RESCHEDULE command to condor scheduler
>>    Submitting job(s)
>>    ERROR: Failed to connect to local queue manager
>>    CEDAR:6001:Failed to connect to <153.90.184.186:40026
>>    <http://153.90.184.186:40026>>
>>    Submitting job(s)
>>    ERROR: Failed to connect to local queue manager
>>    CEDAR:6001:Failed to connect to <153.90.184.186:40026
>>    <http://153.90.184.186:40026>>
>>    Submitting job(s)
>>
>>    [orser@bugserv1 tests]$ cat subs/ncbi++_blastp.sub
>>    ####################################
>>    ## run distributed blast          ##
>>    ## Condor submit description file ##
>>    ####################################
>>    getenv      = True
>>    universe    = Vanilla
>>    initialdir  = /home/orser/tests
>>    executable  = /share/bio/ncbi-blast-2.2.22+/bin/blastn
>>    input       = /dev/null
>>    output      = results/ncbi++_blastp.sub.$(Cluster).$(Process).out
>>    WhenToTransferOutput = ON_EXIT_OR_EVICT
>>    error       = results/ncbi++_blastp.sub.$(Cluster).$(Process).err
>>    log         = results/ncbi++_blastp.sub.$(Cluster).$(Process).log
>>    notification = Error
>>
>>    arguments   = "-db /share/data/db/nt -query
>>    /home/orser/tests/data/gdo0001.fas -culling_limit 20 -evalue 1E-5
>>    -num_descriptions 10 -num_alignments 100 -parse_deflines -show_gis
>>    -outfmt 5"
>>
>>    queue
>>
>>    [root@bugserv1 etc]# condor_q
>>
>>    -- Failed to fetch ads from: <153.90.84.186:40026
>>    <http://153.90.84.186:40026>> : bugserv1.core.montana.edu
>>    <http://bugserv1.core.montana.edu>
>>    CEDAR:6001:Failed to connect to <153.90.184.186:40026
>>    <http://153.90.184.186:40026>>
>>
>>
>>    I can restart the head node with.
>>    /etc/init.d/rocks-condor stop
>>    rm -f /tmp/condor*/*
>>    /etc/init.d/rocks-condor start
>>
>>    and the jobs that got submitted do run.
>>
>>    I have trawled through the archives, but haven't found anything
>>    that might be useful.
>>
>>    I've looked at the logs, but not finding any clues there.
>>    I can provide them if that might be useful.
>>
>>    The changes from a stock install, are minor.
>>    (I just brought the cluster up this week)
>>
>>    [root@bugserv1 etc]# diff condor_config.local
>>    condor_config.local.08Jul09
>>    20c20
>>    < LOCAL_DIR = /mnt/system/condor                                  ---
>>    > LOCAL_DIR = /var/opt/condor
>> 27,29c27
>>    < PREEMPT = True                                                 <
>> UWCS_PREEMPTION_REQUIREMENTS = ( $(StateTimer) > (8 * $(HOUR)) && \
>>    <          RemoteUserPrio > SubmittorPrio * 1.2 ) || (MY.NiceUser
>>    == True)
>>    ---
>>    > PREEMPT = False
>>
>>    Just a bigger volume, and 8 hour preemption quanta.
>>
>>    Ideas?
>>
>>    --     Cheers, Gary
>>    Systems Manager, Bioinformatics
>>    Montana State University
>>
>>
>>
>>
>> --
>> Cheers, Gary
>>
>>
>> ------------------------------------------------------------------------
>>
>> _______________________________________________
>> Condor-users mailing list
>> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>>
>> The archives can be found at:
>> https://lists.cs.wisc.edu/archive/condor-users/
>>
>
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/
>
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/



--
Cheers, Gary





--
Cheers, Gary



_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/




--
Philip Papadopoulos, PhD
University of California, San Diego
858-822-3628 (Ofc)
619-331-2990 (Fax)



--
Philip Papadopoulos, PhD
University of California, San Diego
858-822-3628 (Ofc)
619-331-2990 (Fax)