[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] condor 6.8.2 + RHEL 4 - jobs stay idle, never run



I've got a working 6.6.10 pool but there doesn't seem to be a 6.6.x
release for RHEL 4-amd64 so I'm trying to get 6.8.2 working on those
hosts.

I'm thinking maybe my problem is caused by a new host Requirement for
Checkpoint stuff:

6.6.10:

	Requirements = START

6.8.2:

	Requirements = (START) && (IsValidCheckpointPlatform)

We're using the vanilla universe here - or trying to, anyway.  I have
it set as the default in condor_config and the job control file says
to use it as well.  I have, however, not found the magic bullet that
makes whatever is adding "&& (IsValidCheckpointPlatform)" to the host's
requirements.  It isn't in condor_config or condor_config.local.

When I look in the logs for what is going on I see the following on
the submit host:

11/17 09:55:02 (pid:15196) Activity on stashed negotiator socket
11/17 09:55:02 (pid:15196) Negotiating for owner: nomad@ee.(domain obscured)
11/17 09:55:02 (pid:15196) Checking consistency running and runnable jobs
11/17 09:55:02 (pid:15196) Tables are consistent
11/17 09:55:02 (pid:15196) Out of jobs - 1 jobs matched, 0 jobs idle, flock level = 0
11/17 09:55:02 (pid:15196) Sent ad to central manager for nomad@ee.(domain obscured)
11/17 09:55:02 (pid:15196) Sent ad to 1 collectors for nomad@ee.(domain obscured)
11/17 09:55:02 (pid:15196) Sent RELEASE_CLAIM to startd on <128.208.233.100:41261>
11/17 09:55:02 (pid:15196) Match record (<128.208.233.100:41261>, 2, 0) deleted



On the host it is negotiating with I see:

11/17 09:55:02 DaemonCore: Command received via UDP from condor from host <128.2
08.232.24:33683>
11/17 09:55:02 DaemonCore: received command 440 (MATCH_INFO), calling handler (c
ommand_match_info)
11/17 09:55:02 vm1: match_info called
11/17 09:55:02 vm1: Received match <128.208.233.100:41261>#1163786034#5
11/17 09:55:02 vm1: State change: match notification protocol successful
11/17 09:55:02 vm1: Changing state: Unclaimed -> Matched
11/17 09:55:02 DaemonCore: Command received via TCP from condor from host <128.2
08.232.90:37122>
11/17 09:55:02 DaemonCore: received command 442 (REQUEST_CLAIM), calling handler
 (command_request_claim)
11/17 09:55:02 vm1: Request to claim resource refused.
11/17 09:55:02 vm1: Job requirements not satisfied.
11/17 09:55:02 vm1: State change: claiming protocol failed
11/17 09:55:02 vm1: Changing state: Matched -> Owner
11/17 09:55:02 vm1: State change: IS_OWNER is false
11/17 09:55:02 vm1: Changing state: Owner -> Unclaimed
11/17 09:55:02 DaemonCore: Command received via UDP from condor from host <128.2
08.232.90:35833>
11/17 09:55:02 DaemonCore: received command 443 (RELEASE_CLAIM), calling handler
 (command_release_claim)
11/17 09:55:02 Warning: can't find resource with ClaimId (<128.208.233.100:41261
>#1163786034#5)


I turned on D_ALL debugging levels and still don't see what is causing
the rejection.  It just says it is rejecting the job.

condor_q -analyze says:

-- Submitter: stefen.ee.washington.edu : <128.208.232.90:37109> : stefen.ee.washington.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
---
002.000:  Run analysis summary.  Of 385 machines,
    385 are rejected by your job's requirements
      0 reject your job because of their own requirements
      0 match but are serving users with a better priority in the pool
      0 match but reject the job for unknown reasons
      0 match but will not currently preempt their existing job
      0 are available to run your job
        Last successful match: Fri Nov 17 09:56:18 2006

WARNING:  Be advised:
   No resources matched request's constraints
   Check the Requirements expression below:

Requirements = ((MY.RESOURCE_GROUP == TARGET.JOB_GROUP)) && (Arch == "X86_64") && (OpSys == "LINUX") && (Disk >= DiskUsage) && ((Memory * 1024) >= ImageSize) && (TARGET.FileSystemDomain == MY.FileSystemDomain)


1 jobs; 1 idle, 0 running, 0 held



When I check the Requirements listed here they all match.  I can't find
anything that doesn't match.

I've run this with our production condor_master (6.6.10) as well as
trying with a 6.8.2 master.


Can anyone offer any advice|guidance? Please?

nomad
Sr. System Admin, UWEE SSLI Lab