[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] idle job and "Request has not yet been considered by the matchmaker"



hello,

I am trying to submit a job to a specific machine in my test pool. Both the code and the submit file have been tested with other nodes, and the only difference this time is

Requirements = (Machine == "node2.netA.netB.netC")

When running condor_q I see that it is listed as I(dle). With

condor_q -analyze <job ID>

I see

  <ID>: Request has not yet been considered by the matchmaker.
  (...)
  <ID>: Run analysis summary. Of 23 machines,
  13 are rejected by your job's requirements
  Â0 reject your job because of their own requirements
  Â0 match but are serving other users
  Â10 are available to run your job

and indeed there are 10 slots in node2.netA.netB.netC (the remaining 13 are in the head node and another node). The only suggestion is to remove the machine specific requirement.

Looking at Â/var/log/condor/NegotiatorLog, I see

"Successfully matched with slot@xxxxxxxxxxxxxxxxxxxx"

In /var/log/condor/MatchLog, I see

"Matched <ID> <user> <IP for head:53694?addrs=IP for head-53694> preempting node2<IP for node:13698?address=IP for node2-13698> slot1@xxxxxxxxxxxxxxxxxxx

Both of these messages recur every minute or so.

On node2, only MASTER and STARTD are running, and neither of the respective logs show any mention of this job (using tail -f to track at the moment of submission.

/etc/condor/condor_config is precisely the same between node1 and node2. The only difference between them is that, despite having the same domain in their FQDN (netA.netB.netC) the actual subnets are different ( node 2 is in ip2.ipB.ipC, whereas node1 and the head node are in ip1.ipB.ipC). /etc/hosts contains <IP> <name> <FQDN> for all three machines in each one of them. In all condor_configs, I use

FILESYSTEM_DOMAIN = netA.netB.netC
UID_DOMAIN = netA.netB.netC
DEFAULT_DOMAIN_NAME = netA.netB.netC
TRUST_UID_DOMAIN = netA.netB.netC
SOFT_UID_DOMAIN = netA.netB.netC
TRUST_UID_DOMAIN = TRUE
STARTER_ALLOW_RUNAS_OWNER = TRUE

All nodes use home directories exported from the head node via NFS, and have matching UIDs and GIDs.

Might someone have encountered this situation? Given the absence of any relevant information on the logs of node2 I am at a loss as to how to proceed...

thank you for any help!
Francisco