[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Condor jobs not starting on a specific machine



Hi,

We've been having an odd problem with one of the machines in our small Condor pool. The machine is configured as a submit & execute machine of the pool, & the required processes (master, startd, schedd & procd) are all running on the machine. The machine's 24 slots display as unclaimed in the condor_status report (& are all listed in condor_status -long).

However, jobs fail to run on the machine & it's currently unclear to me why. When submitting a job specification specifically targeting that machine, the local Scheduler log reports the following;

03/11 15:29:38 (pid:6827) Activity on stashed negotiator socket
03/11 15:29:38 (pid:6827) Negotiating for owner: pb337@xxxxxxxxxxxx
03/11 15:29:38 (pid:6827) Out of servers - 0 jobs matched, 1 jobs idle, 0 jobs rejected

condor_q -analyze for the problem job specified to run on the problem machine reports;

007.000:  Run analysis summary.  Of 75 machines,
     75 are rejected by your job's requirements
      0 reject your job because of their own requirements
      0 match but are serving users with a better priority in the pool
      0 match but reject the job for unknown reasons
      0 match but will not currently preempt their existing job
      0 match but are currently offline
      0 are available to run your job
No successful match recorded.
Last failed match: Fri Mar 11 17:00:11 2011
Reason for last match failure: no match found

WARNING:  Be advised:
   No resources matched request's constraints
   Check the Requirements _expression_ below:

Requirements = ((OpSys == "LINUX" && ARCH == "X86_64") && (Machine == "starnet.st-and.ac.uk")) && ((CkptArch == Arch) || (CkptArch =?= UNDEFINED)) && ((CkptOpSys == OpSys) || (CkptOpSys =?= UNDEFINED)) && (Disk >= DiskUsage) && ((Memory * 1024) >= ImageSize) && ((RequestMemory * 1024) >= ImageSize)


1 jobs; 1 idle, 0 running, 0 held

So jobs are being queued as idle & simply not matched with any of the available slots on the machine, when I specify the job to run on that machine. If you submit the same job to another machine in the pool, it runs immediately. None of the requirements listed should conflict with the problem machine's spec, but somehow they are.

All the submit & execute machines in the pool are running the same Condor version (7.4.4 Oct 13 2010 BuildID: 279383 $ CondorPlatform: X86_64-LINUX_RHEL5), on the same version of Scientific Linux. There are no differences in the local configuration file of the problem machine compared to a machine which is running jobs as expected. If I do a condor_config -dump of the problem machine's local config & compare it to a working machine, the only differences are what you would expect (number of cores, amount of memory, hostnames, IP addresses, PIDs & PPIDs).

Would anyone have any idea what might be causing this problem? The machine has 24 processing slots available, so getting it working again would definitely be great. A week ago we attempted to implement a checkpoint server config on the same machine which serves as the pool's central manager, but rolled back that change later. Might that be causing the problem? The checkpoint server is a 32-bit machine, as opposed to the rest of the pool which are all 64-bit.

Thanks,
Paul Browne.

--
__________________________________
Mr. Paul Browne
School of Physics & Astronomy,
University of St. Andrews,
North Haugh, St. Andrews,
Fife, KY16 9SS,
Scotland, UK

t:  +44 (0)1334 46 3152
e:  pb337@xxxxxxxxxxxxxxxx
__________________________________