[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Matched slot refused to accept claim.



Hello, 

We meet a strange situation. We have execute machine with 32 slots and recently it runs jobs with 3 tasks but refused to run the same job with 4 tasks. In the schedd log I found the following:

condor_negotiator[51]:       Matched 145.12 DedicatedScheduler@parallel_schedd@submit.pseven-htcondor <10.244.3.199:36291?addrs=10.244.3.199-36291&alias=submit.pseven-htcondor&noUDP&sock=schedd_823_66f0> preempting none <192.168.253.6:9618?CCBID=10.244.3.198:19618%3faddrs%3d10.244.3.198-19618%26alias%3dpseven-htcondormanager-deploy-6978c9c88f-vwkqv.pseven-htcondor%26noUDP%26sock%3dcollector#3&PrivAddr=%3c192.168.253.6:9618%3fsock%3dstartd_1112_20fe%3e&PrivNet=pseven-htcondor-remote-evil&addrs=192.168.253.6-9618&alias=evil&noUDP&sock=startd_1112_20fe> slot4@evil
condor_negotiator[51]:       Successfully matched with slot4@evil
condor_negotiator[51]:  negotiateWithGroup resources used submitterAds length 0
condor_negotiator[51]: ---------- Finished Negotiation Cycle ----------
condor_schedd[865]: Negotiation ended - 1 jobs matched
condor_schedd[865]: Finished negotiating for DedicatedScheduler@parallel_schedd in local pool: 1 matched, 0 rejected
condor_schedd[865]: Request was NOT accepted for claim slot4@evil <192.168.253.6:9618?CCBID=10.244.3.198:19618%3faddrs%3d10.244.3.198-19618%26alias%3dpseven-htcondormanager-deploy-6978c9c88f-vwkqv.pseven-htcondor%26noUDP%26sock%3dcollector#3&PrivAddr=%3c192.168.253.6:9618%3fsock%3dstartd_1112_20fe%3e&PrivNet=pseven-htcondor-remote-evil&addrs=192.168.253.6-9618&alias=evil&noUDP&sock=startd_1112_20fe> for DedicatedScheduler@parallel_schedd -1.-1
condor_schedd[865]: Received a superuser command
condor_schedd[865]: TransferQueueManager stats: active up=0/100 down=0/100; waiting up=0 down=0; wait time up=0s down=0s
condor_schedd[865]: TransferQueueManager upload 1m I/O load: 0 bytes/s  0.000 disk load  0.000 net load
condor_schedd[865]: TransferQueueManager download 1m I/O load: 0 bytes/s  0.000 disk load  0.000 net load
condor_schedd[865]: SetAttribute modifying attribute Scheduler in nonexistent job 145.14

On the execute machine there is errors in the StartLog (only important parts, full log is here: https://pastebin.com/NaEdXj2U ):

1623836310 (D_ALWAYS|D_FAILURE) slot4: Job requirements not satisfied.
1623836310 (D_ALWAYS) slot4: Job ad was ============================
DiskUsage = 42
ImageSize = 0
NumJobStarts = 0
RequestDisk = DiskUsage
RequestMemory = ifthenelse(MemoryUsage =!= undefined,MemoryUsage,(ImageSize + 1023) / 1024)
Requirements = ((NumJobStarts == 0) && (OpSys == "LINUX" || OpSys == "WINDOWS") && (Arch == "X86_64") && (DA__P7__RUNENV_PYTHON3 >= 13) && (DA__P7__HOST_ID == "evil")) && (TARGET.Disk >= RequestDisk) && (TARGET.Memory >= RequestMemory) && ((TARGET.FileSystemDomain == MY.FileSystemDomain) || (TARGET.HasFileTransfer))
1623836310 (D_ALWAYS) slot4: Slot ad was ============================
Arch = "X86_64"
DA__P7__HOST_ID = "evil"
DA__P7__RUNENV_PYTHON3 = 13
Disk = 35019350
DiskUsage = 87
HasFileTransfer = true
Memory = 1535
MemoryUsage = ((ResidentSetSize + 1023) / 1024)
OpSys = "WINDOWS"
ResidentSetSize = 25903268
1623836310 (D_ALWAYS) slot4: Request to claim resource refused.
1623836310 (D_ALWAYS) slot4: State change: claiming protocol failed
1623836310 (D_ALWAYS) slot4: Changing state: Unclaimed -> Owner
1623836310 (D_ALWAYS) slot4: State change: IS_OWNER is false
1623836310 (D_ALWAYS) slot4: Changing state: Owner -> Unclaimed

I guess that task memory or disk requirements wan not met by the slot. But the in the log there is enough memory and disk space. Where else can we look to find out the problem?

----------
Sergey Komissarov
Senior Software Developer
DATADVANCE

This message may contain confidential information
constituting a trade secret of DATADVANCE. Any distribution,
use or copying of the information contained in this
message is ineligible except under the internal
regulations of DATADVANCE and may entail liability in
accordance with the current legislation of the Russian
Federation. If you have received this message by mistake
please immediately inform me of it. Thank you!