I just set up a one-machine cluster on a Fedora workstation, using the default package (8.8.10), and this is my first time setting up a condor cluster using roles. I followed the "quick start" guide in the administration part of the manual, setting CentralManager, Exec, and
submit roles, along with password authentication, and everything looks good. It's an 18-core machine with hyperthreading, and 36 slots show up in condor_status. I submitted "sleep.sub" from
https://research.cs.wisc.edu/htcondor/manual/quickstart.html, and the job remains Idle. Looks like it's being rejected by the negotiator because "36 reject your job because of their own requirements". That's new for me. I could use some help debugging that.
$ condor_q -better-analyze 2.0
-- Schedd: clh-8842.lab.core : <
172.16.8.48:9618?...
The Requirements _expression_ for job 2.000 is
(TARGET.Arch == "X86_64") && (TARGET.OpSys == "LINUX") && (TARGET.Disk >= RequestDisk) && (TARGET.Memory >= RequestMemory) &&
(TARGET.HasFileTransfer)
Job 2.000 defines the following attributes:
DiskUsage = 1
ImageSize = 1
RequestDisk = DiskUsage
RequestMemory = ifthenelse(MemoryUsage =!= undefined,MemoryUsage,(ImageSize + 1023) / 1024)
The Requirements _expression_ for job 2.000 reduces to these conditions:
Slots
Step Matched Condition
----- -------- ---------
[0] 36 TARGET.Arch == "X86_64"
[1] 36 TARGET.OpSys == "LINUX"
[3] 36 TARGET.Disk >= RequestDisk
[5] 36 TARGET.Memory >= RequestMemory
[7] 36 TARGET.HasFileTransfer
No successful match recorded.
Last failed match: Thu Sep 10 18:03:28 2020
Reason for last match failure: no match found
002.000: Run analysis summary ignoring user priority. Of 36 machines,
0 are rejected by your job's requirements
36 reject your job because of their own requirements
0 match and are already running your jobs
0 match but are serving other users
0 are able to run your job
WARNING: Be advised:
Job did not match any machines's constraints
To see why, pick a machine that you think should match and add
-reverse -machine <name>
to your query.
For what it's worth, adding "-reverse -machine clh-8842.core.lab" to the query didn't return anything useful.
I'm guessing the problem might be the "undefined" in the RequestMemory attribute, but I'm not sure, and I'm not sure why it's undefined.