[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Condor 7.4.2 not working on Rocks 5.3



Trying sending again ...

On Fri, Jul 9, 2010 at 10:46 AM, Gary Orser <garyorser> wrote:
Hi all,

I just upgraded my cluster from Rocks 5.1 to 5.3.
This upgraded Condor from 7.2.? to 7.4.2.

I've got everything running, but it won't stay up.
(I have had the previous configuration running with condor for years, done millions of hours of compute)

I have a good repeatable test case.
(each job runs for a couple of minutes)

[orser@bugserv1 tests]$ for i in `seq 1 100` ; do condor_submit subs/ncbi++_blastp.sub ; done
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 24.
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 25.
Submitting job(s).
.
.
.
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 53.

WARNING: File /home/orser/tests/results/ncbi++_blastp.sub.53.0.err is not writable by condor.

WARNING: File /home/orser/tests/results/ncbi++_blastp.sub.53.0.out is not writable by condor.
Can't send RESCHEDULE command to condor scheduler
Submitting job(s)
ERROR: Failed to connect to local queue manager
CEDAR:6001:Failed to connect to <153.90.184.186:40026>
Submitting job(s)
ERROR: Failed to connect to local queue manager
CEDAR:6001:Failed to connect to <153.90.184.186:40026>
Submitting job(s)

[orser@bugserv1 tests]$ cat subs/ncbi++_blastp.sub
####################################
## run distributed blast          ##
## Condor submit description file ##
####################################
getenv      = True
universe    = Vanilla
initialdir  = /home/orser/tests
executable  = /share/bio/ncbi-blast-2.2.22+/bin/blastn
input       = /dev/null
output      = results/ncbi++_blastp.sub.$(Cluster).$(Process).out
WhenToTransferOutput = ON_EXIT_OR_EVICT
error       = results/ncbi++_blastp.sub.$(Cluster).$(Process).err
log         = results/ncbi++_blastp.sub.$(Cluster).$(Process).log
notification = Error

arguments   = "-db /share/data/db/nt -query /home/orser/tests/data/gdo0001.fas -culling_limit 20 -evalue 1E-5 -num_descriptions 10 -num_alignments 100 -parse_deflines -show_gis -outfmt 5"

queue

[root@bugserv1 etc]# condor_q

-- Failed to fetch ads from: <153.90.84.186:40026> : bugserv1.core.montana.edu
CEDAR:6001:Failed to connect to <153.90.184.186:40026>


I can restart the head node with.
/etc/init.d/rocks-condor stop
rm -f /tmp/condor*/*
/etc/init.d/rocks-condor start

and the jobs that got submitted do run.

I have trawled through the archives, but haven't found anything
that might be useful.

I've looked at the logs, but not finding any clues there.
I can provide them if that might be useful.

The changes from a stock install, are minor.
(I just brought the cluster up this week)

[root@bugserv1 etc]# diff condor_config.local condor_config.local.08Jul09
20c20
< LOCAL_DIR = /mnt/system/condor                              
---
> LOCAL_DIR = /var/opt/condor                                 
27,29c27
< PREEMPT = True                                             
< UWCS_PREEMPTION_REQUIREMENTS = ( $(StateTimer) > (8 * $(HOUR)) && \
<          RemoteUserPrio > SubmittorPrio * 1.2 ) || (MY.NiceUser == True)
---
> PREEMPT = False

Just a bigger volume, and 8 hour preemption quanta.

Ideas?

--
Cheers, Gary
Systems Manager, Bioinformatics
Montana State University




--
Cheers, Gary