[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Condor 7.4.2 not working on Rocks 5.3



can you type,

condor_config_val -dump ?

Looks like a security issue,

http://www.cs.wisc.edu/condor/manual/v7.4/3_6Security.html#sec:Host-Security

What does HOSTALLOW_READ and HOSTALL_WRITE look like?






On Mon, Jul 12, 2010 at 1:12 PM, Dan Bradley <dan@xxxxxxxxxxxx> wrote:
> Gary,
>
> It may help to look in SchedLog to see what is happening to your
> condor_schedd.
>
> --Dan
>
> Gary Orser wrote:
>>
>> Trying sending again ...
>>
>> On Fri, Jul 9, 2010 at 10:46 AM, Gary Orser <garyorser> wrote:
>>
>>    Hi all,
>>
>>    I just upgraded my cluster from Rocks 5.1 to 5.3.
>>    This upgraded Condor from 7.2.? to 7.4.2.
>>
>>    I've got everything running, but it won't stay up.
>>    (I have had the previous configuration running with condor for
>>    years, done millions of hours of compute)
>>
>>    I have a good repeatable test case.
>>    (each job runs for a couple of minutes)
>>
>>    [orser@bugserv1 tests]$ for i in `seq 1 100` ; do condor_submit
>>    subs/ncbi++_blastp.sub ; done
>>    Submitting job(s).
>>    Logging submit event(s).
>>    1 job(s) submitted to cluster 24.
>>    Submitting job(s).
>>    Logging submit event(s).
>>    1 job(s) submitted to cluster 25.
>>    Submitting job(s).
>>    .
>>    .
>>    .
>>    Submitting job(s).
>>    Logging submit event(s).
>>    1 job(s) submitted to cluster 53.
>>
>>    WARNING: File /home/orser/tests/results/ncbi++_blastp.sub.53.0.err
>>    is not writable by condor.
>>
>>    WARNING: File /home/orser/tests/results/ncbi++_blastp.sub.53.0.out
>>    is not writable by condor.
>>    Can't send RESCHEDULE command to condor scheduler
>>    Submitting job(s)
>>    ERROR: Failed to connect to local queue manager
>>    CEDAR:6001:Failed to connect to <153.90.184.186:40026
>>    <http://153.90.184.186:40026>>
>>    Submitting job(s)
>>    ERROR: Failed to connect to local queue manager
>>    CEDAR:6001:Failed to connect to <153.90.184.186:40026
>>    <http://153.90.184.186:40026>>
>>    Submitting job(s)
>>
>>    [orser@bugserv1 tests]$ cat subs/ncbi++_blastp.sub
>>    ####################################
>>    ## run distributed blast          ##
>>    ## Condor submit description file ##
>>    ####################################
>>    getenv      = True
>>    universe    = Vanilla
>>    initialdir  = /home/orser/tests
>>    executable  = /share/bio/ncbi-blast-2.2.22+/bin/blastn
>>    input       = /dev/null
>>    output      = results/ncbi++_blastp.sub.$(Cluster).$(Process).out
>>    WhenToTransferOutput = ON_EXIT_OR_EVICT
>>    error       = results/ncbi++_blastp.sub.$(Cluster).$(Process).err
>>    log         = results/ncbi++_blastp.sub.$(Cluster).$(Process).log
>>    notification = Error
>>
>>    arguments   = "-db /share/data/db/nt -query
>>    /home/orser/tests/data/gdo0001.fas -culling_limit 20 -evalue 1E-5
>>    -num_descriptions 10 -num_alignments 100 -parse_deflines -show_gis
>>    -outfmt 5"
>>
>>    queue
>>
>>    [root@bugserv1 etc]# condor_q
>>
>>    -- Failed to fetch ads from: <153.90.84.186:40026
>>    <http://153.90.84.186:40026>> : bugserv1.core.montana.edu
>>    <http://bugserv1.core.montana.edu>
>>    CEDAR:6001:Failed to connect to <153.90.184.186:40026
>>    <http://153.90.184.186:40026>>
>>
>>
>>    I can restart the head node with.
>>    /etc/init.d/rocks-condor stop
>>    rm -f /tmp/condor*/*
>>    /etc/init.d/rocks-condor start
>>
>>    and the jobs that got submitted do run.
>>
>>    I have trawled through the archives, but haven't found anything
>>    that might be useful.
>>
>>    I've looked at the logs, but not finding any clues there.
>>    I can provide them if that might be useful.
>>
>>    The changes from a stock install, are minor.
>>    (I just brought the cluster up this week)
>>
>>    [root@bugserv1 etc]# diff condor_config.local
>>    condor_config.local.08Jul09
>>    20c20
>>    < LOCAL_DIR = /mnt/system/condor                                  ---
>>    > LOCAL_DIR = /var/opt/condor
>> 27,29c27
>>    < PREEMPT = True                                                 <
>> UWCS_PREEMPTION_REQUIREMENTS = ( $(StateTimer) > (8 * $(HOUR)) && \
>>    <          RemoteUserPrio > SubmittorPrio * 1.2 ) || (MY.NiceUser
>>    == True)
>>    ---
>>    > PREEMPT = False
>>
>>    Just a bigger volume, and 8 hour preemption quanta.
>>
>>    Ideas?
>>
>>    --     Cheers, Gary
>>    Systems Manager, Bioinformatics
>>    Montana State University
>>
>>
>>
>>
>> --
>> Cheers, Gary
>>
>>
>> ------------------------------------------------------------------------
>>
>> _______________________________________________
>> Condor-users mailing list
>> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>>
>> The archives can be found at:
>> https://lists.cs.wisc.edu/archive/condor-users/
>>
>
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/
>