Yeah I forgot to mention, it’s all 6.6.9 (maybe the occasional 6.6.8) installs, Vanilla, Windows platform.
Well, we switched the condor master to a quad processor with gigabit Ethernet running Windows 2000 Server, and we’re also sending jobs from there (and only there), but it doesn’t start the startd so it cannot run jobs. Unfortunately we’re still having the same problem. None of the processors ever peaks, even during the longest queue of jobs, and the network card never registers over 25% or 30% of its total available bandwidth.
The jobs all have the same rank. I don’t set anything in the submit files, so it’s whatever the default is, 0 I guess. But on the machines that are claimed+idle, if I do a “condor_q –run” I get a whole bunch of jobs that think they’re running on [???????????????????????] as machine name. Only the jobs that are actually on a machine set to claimed+busy have a real machine name next to them. The machines are most definitely idle, since it’s after hours, the loads are showing up as 0 in condor_status, and if I only submit around, say, 100 jobs, they all start, and when the queue gets to 200 or more, only a handful will continue to run jobs. It’s like clockwork. If I submit 100, bam, they all run, I submit 100 more before they’re done, and I only get a handful of claimed+busy machines until they chew down to around 100 jobs again.
What about the DEACTIVATE_CLAIM_FORCIBLY that shows up in all the logs? Does that mean the master is telling the worker node that it should stay idle because something is timing out?
I’m sort of stumped since the negotiator and schedd’s on the master/submitter never even register 25% on a processor in the performance monitor, nothing else is running on that machine, and the network card never runs out of bandwidth. However, I’ll change the PREEMPTION_REQUIREMENTS setting you mentioned.