[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Condor-users] Our pool appears to work inefficiently

>   I could be exposing my lack of knowledge of the mechanics of condor
> pools, however on hand I am quite surprised that the performance of
> pool is, on the whole, quite poor. The composition of the pool is
> complicated -- there are machines from different departments and/or
> subnet, and so this may be a very difficult issue to analyse or for
> one to advise us on...
> According to condor_status most of the machines are unclaimed, however
> when I submit a batch of 100 simple jobs I find that maybe 50% of them
> will run simultaneously in the pool -- the rest are rejected, and
> condor_q tells me that machines do match however reject the jobs for
> some unknown reason. The vast majority of the machines are running XP
> with SP2.
> Can anyone please advise us in this respect. For example what might be
> wrong in the pool, or what analysis might we consider doing?

>    1216 match, match, but reject the job for unknown reasons

The trick to figuring this out would be to track down these "unknown
reasons".  Are there certain machines that are consistently able to run
jobs?  Are there certain machines that consistently fail to run jobs?
You can find successful machines by looking at the "LastRemoteHost"
attribute that condor_history <cluster.proc> -l reports.  Then see if
you can find failures by looking at the ShadowLog on the submitting
machine.  You may want to have a look at my Troubleshooting page:


My guess is that some of your machines are somehow mis-configured and
that jobs are going there, dying, and getting kicked off, only to start
somewhere else and succeed.  

Mike Yoder
Principal Member of Technical Staff
Ask Mike: http://docs.optena.com
Direct  : +1.408.321.9000
Fax     : +1.408.321.9030
Mobile  : +1.408.497.7597

Optena Corporation
2860 Zanker Road, Suite 201
San Jose, CA 95134