[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Our pool appears to work inefficiently



Hello,

my 1 euro cent :

I got the same kind of symptoms after tripling the node number : it was just the NFS server not answering fast enough. In my case, the solution was to allow more nfs server instances, and to mount the shared partition with the
options ' rw,hard,nointr,tcp,vers=3,rsize=32k,wsize=32k,bg '


Now the submits are taken into account quickly and on all the nodes.

	Hoping to help

	Alain


Michael Yoder wrote:
 I could be exposing my lack of knowledge of the mechanics of condor
pools, however on hand I am quite surprised that the performance of

the

pool is, on the whole, quite poor. The composition of the pool is
complicated -- there are machines from different departments and/or
subnet, and so this may be a very difficult issue to analyse or for

any

one to advise us on...

According to condor_status most of the machines are unclaimed, however
when I submit a batch of 100 simple jobs I find that maybe 50% of them
will run simultaneously in the pool -- the rest are rejected, and
condor_q tells me that machines do match however reject the jobs for
some unknown reason. The vast majority of the machines are running XP
with SP2.

Can anyone please advise us in this respect. For example what might be
wrong in the pool, or what analysis might we consider doing?


1216 match, match, but reject the job for unknown reasons


The trick to figuring this out would be to track down these "unknown
reasons".  Are there certain machines that are consistently able to run
jobs?  Are there certain machines that consistently fail to run jobs?
You can find successful machines by looking at the "LastRemoteHost"
attribute that condor_history <cluster.proc> -l reports.  Then see if
you can find failures by looking at the ShadowLog on the submitting
machine.  You may want to have a look at my Troubleshooting page:

http://docs.optena.com/display/CONDOR/Troubleshooting

My guess is that some of your machines are somehow mis-configured and
that jobs are going there, dying, and getting kicked off, only to start
somewhere else and succeed.


Mike Yoder
Principal Member of Technical Staff
Ask Mike: http://docs.optena.com
Direct  : +1.408.321.9000
Fax     : +1.408.321.9030
Mobile  : +1.408.497.7597
yoderm@xxxxxxxxxx

Optena Corporation
2860 Zanker Road, Suite 201
San Jose, CA 95134
http://www.optena.com


_______________________________________________ Condor-users mailing list Condor-users@xxxxxxxxxxx https://lists.cs.wisc.edu/mailman/listinfo/condor-users



-- ------------------------------------------------------------ Dr Alain EMPAIN <alain.empain@xxxxxxxxx> <alain@xxxxxxxxxx> Bioinformatics, Molecular Genetics, Fac. Med. Vet., University of LIEGEe, Belgium Bd de Colonster, B43 B-4000 LIEGEe (Sart-Tilman) WORK: +32 4 366 4159 FAX: +32 4 366 4122 HOME: rue des Martyrs,7 B- 4550 Nandrin +32 85 51 2341 GSM: +32 497 70 1764 ------------------------------------------------------------------------------- "I worry about my child and the Internet all the time, even though she's too young to have logged on yet. Here's what I worry about. I worry that 10 or 15 years from now, she will come to me and say 'Daddy, where were you when they took freedom of the press away from the Internet?'" --Mike Godwin, Electronic Frontier Foundation -------------------------------------------------------------------------------