[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Condor 7.4.2 not working on Rocks 5.3



Gary,

It may help to look in SchedLog to see what is happening to your condor_schedd.

--Dan

Gary Orser wrote:
Trying sending again ...

On Fri, Jul 9, 2010 at 10:46 AM, Gary Orser <garyorser> wrote:

    Hi all,

    I just upgraded my cluster from Rocks 5.1 to 5.3.
    This upgraded Condor from 7.2.? to 7.4.2.

    I've got everything running, but it won't stay up.
    (I have had the previous configuration running with condor for
    years, done millions of hours of compute)

    I have a good repeatable test case.
    (each job runs for a couple of minutes)

    [orser@bugserv1 tests]$ for i in `seq 1 100` ; do condor_submit
    subs/ncbi++_blastp.sub ; done
    Submitting job(s).
    Logging submit event(s).
    1 job(s) submitted to cluster 24.
    Submitting job(s).
    Logging submit event(s).
    1 job(s) submitted to cluster 25.
    Submitting job(s).
    .
    .
    .
    Submitting job(s).
    Logging submit event(s).
    1 job(s) submitted to cluster 53.

    WARNING: File /home/orser/tests/results/ncbi++_blastp.sub.53.0.err
    is not writable by condor.

    WARNING: File /home/orser/tests/results/ncbi++_blastp.sub.53.0.out
    is not writable by condor.
    Can't send RESCHEDULE command to condor scheduler
    Submitting job(s)
    ERROR: Failed to connect to local queue manager
    CEDAR:6001:Failed to connect to <153.90.184.186:40026
    <http://153.90.184.186:40026>>
    Submitting job(s)
    ERROR: Failed to connect to local queue manager
    CEDAR:6001:Failed to connect to <153.90.184.186:40026
    <http://153.90.184.186:40026>>
    Submitting job(s)

    [orser@bugserv1 tests]$ cat subs/ncbi++_blastp.sub
    ####################################
    ## run distributed blast          ##
    ## Condor submit description file ##
    ####################################
    getenv      = True
    universe    = Vanilla
    initialdir  = /home/orser/tests
    executable  = /share/bio/ncbi-blast-2.2.22+/bin/blastn
    input       = /dev/null
    output      = results/ncbi++_blastp.sub.$(Cluster).$(Process).out
    WhenToTransferOutput = ON_EXIT_OR_EVICT
    error       = results/ncbi++_blastp.sub.$(Cluster).$(Process).err
    log         = results/ncbi++_blastp.sub.$(Cluster).$(Process).log
    notification = Error

    arguments   = "-db /share/data/db/nt -query
    /home/orser/tests/data/gdo0001.fas -culling_limit 20 -evalue 1E-5
    -num_descriptions 10 -num_alignments 100 -parse_deflines -show_gis
    -outfmt 5"

    queue

    [root@bugserv1 etc]# condor_q

    -- Failed to fetch ads from: <153.90.84.186:40026
    <http://153.90.84.186:40026>> : bugserv1.core.montana.edu
    <http://bugserv1.core.montana.edu>
    CEDAR:6001:Failed to connect to <153.90.184.186:40026
    <http://153.90.184.186:40026>>


    I can restart the head node with.
    /etc/init.d/rocks-condor stop
    rm -f /tmp/condor*/*
    /etc/init.d/rocks-condor start

    and the jobs that got submitted do run.

    I have trawled through the archives, but haven't found anything
    that might be useful.

    I've looked at the logs, but not finding any clues there.
    I can provide them if that might be useful.

    The changes from a stock install, are minor.
    (I just brought the cluster up this week)

    [root@bugserv1 etc]# diff condor_config.local
    condor_config.local.08Jul09
    20c20
< LOCAL_DIR = /mnt/system/condor --- > LOCAL_DIR = /var/opt/condor 27,29c27 < PREEMPT = True < UWCS_PREEMPTION_REQUIREMENTS = ( $(StateTimer) > (8 * $(HOUR)) && \
    <          RemoteUserPrio > SubmittorPrio * 1.2 ) || (MY.NiceUser
    == True)
    ---
    > PREEMPT = False

    Just a bigger volume, and 8 hour preemption quanta.

    Ideas?

-- Cheers, Gary
    Systems Manager, Bioinformatics
    Montana State University




--
Cheers, Gary


------------------------------------------------------------------------

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/