[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Cannot get all workers in a cluster to work on jobs

Brian Pipa <brianpipa@xxxxxxxxx> wrote:
> The problem is, I can never get all 4 machines to work on jobs.

When you're doing your 100 job tests, are you only doing
submissions from one host at a time?  If all of the machines have
jobs submitted at the same time, it's entirely possible that
HTCondor will hand each schedd an execute node or or two to run on
and not bother every changing it.

- Suggestion: be sure all of the queues except one are empty.

Are your jobs really short?  HTCondor may rapidly be spinning jobs
onto the same subset of matched slots as they are quickly freed.  

- Suggestion: make sure your jobs aren't trivial.  I like
  "executable=/bin/sleep" "arguments="300".

Now you say that slots are getting matches, but not actually
used.  If none of the above helps, I wonder if you're running
into firewalls or other problems.  Check the SchedLog and
ShadowLog on the submit node, especially looking for complaints
about problems connecting or anything with "error" in the text.
You might also check the StartLog on an execute node that isn't
running jobs, keeping an eye open for complaints about
connections being rejected for security reasons or that a job was
rejected for policy reasons (typically START/REQUIREMENTS).  You
might also check our the NegotiatorLog on your central manager
for errors, but that's more of a long shot.

If none of that works, my next suggestion is:

- Shut all of HTCondor down.

- Erase your logs (or move them elsewhere).

- Restart HTCondor on your nodes.

- Submit your 100 jobs on one node.

- Wait until jobs are running on some slots and it's clear that
  they're not running on others.

- Wait 5ish more minutes.

- Shut all of HTCondor down.

- Gather up and send to this list (if they're small) or
  htcondor-admin@xxxxxxxxxxx with a note that adesmet promised to
  	- SchedLog and ShadowLog from your submit node.
  	- StartLog and StarterLog from an execute node that should
	  run your jobs, but doesn't.
  	- NegotiatorLog from your central manager.

Alan De Smet                 Center for High Throughput Computing
adesmet@xxxxxxxxxxx                       http://chtc.cs.wisc.edu