[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Not running Parallel-universe jobs?



Hi all,
	I have inherited, and am trying to maintain, a condor cluster.  It was
working nicely on its own for a while.  But we recently had some power
outages that corrupted some client machines.  Ever since, we've had a
periodic problem where Parallel-universe jobs just won't run.  I have an
example right now (hostnames and job criteria omitted; let me know if
they are relevant):

"""
$ condor_q -analyze 9800
-- Submitter: <IPs, hostnames, etc>
---
9800.000:  Run analysis summary.  Of 308 machines,
    260 are rejected by your job's requirements 
      1 reject your job because of their own requirements 
     41 match but are serving users with a better priority in the pool 
      0 match but reject the job for unknown reasons 
      0 match but will not currently preempt their existing job 
      0 match but are currently offline 
      6 are available to run your job
"""

	It says that 6 machines are available to run this job.  But it has been
sitting there for over 20 minutes in the "Queued" state.  Other jobs
have been sitting there for almost a day.

	If I submit a Vanilla-universe job, it will run right away.

	I have machines; I'd like these jobs to run on them.  What am I
missing?

	In case it's relevant, the condor server itself is somewhat older; it
self-reports as running condor 7.6.6.  Most clients are running condor
8.2.1.

Thanks,
Adam