[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] jobs in Idle state



Hi,

your job requests 8 cores and 16gb memory.

Only 9 slots have the memory none of your slots has 8 cores !

Use condor_status to check your slots and set up partitonable slots ... 

Best christoph 


-- 
Christoph Beyer
DESY Hamburg
IT-Department

Notkestr. 85
Building 02b, Room 009
22607 Hamburg

phone:+49-(0)40-8998-2317
mail: christoph.beyer@xxxxxxx


----- UrsprÃngliche Mail -----
Von: Jean-Claude CHEVALEYRE <jean-claude.chevaleyre@xxxxxxxxxxxxxxxxx>
An: htcondor-users@xxxxxxxxxxx
CC: Jean-Claude CHEVALEYRE <chevaleyre@xxxxxxxxxxxxxxxxx>
Gesendet: Tue, 27 Oct 2020 17:58:41 +0100 (CET)
Betreff: [HTCondor-users] jobs in Idle state

Hello

I have some jobs that are still in Idle state on my condor site. I'd like to understand why ?

For example the job  34553 is Idle. but I have some Workers with many slots that are unclained. I put below some information provide by standard command.

For example the workers clrwn333 is empty .

[root@clrarcce01 ~]# condor_status |grep 333
slot1@xxxxxxxxxxxxxxxxx  LINUX      X86_64 Unclaimed Idle      0.000  4719  0+02:59:36
slot2@xxxxxxxxxxxxxxxxx  LINUX      X86_64 Unclaimed Idle      0.000  4719  0+03:00:04
slot3@xxxxxxxxxxxxxxxxx  LINUX      X86_64 Unclaimed Idle      0.000  4719  0+03:00:04
slot4@xxxxxxxxxxxxxxxxx  LINUX      X86_64 Unclaimed Idle      0.000  4719  0+03:00:04
slot5@xxxxxxxxxxxxxxxxx  LINUX      X86_64 Unclaimed Idle      0.000  4719  0+03:00:04
slot6@xxxxxxxxxxxxxxxxx  LINUX      X86_64 Unclaimed Idle      0.000  4719  0+03:00:04
slot7@xxxxxxxxxxxxxxxxx  LINUX      X86_64 Unclaimed Idle      0.000  4719  0+03:00:04
slot8@xxxxxxxxxxxxxxxxx  LINUX      X86_64 Unclaimed Idle      0.000  4719  0+03:00:04
slot9@xxxxxxxxxxxxxxxxx  LINUX      X86_64 Unclaimed Idle      0.000  4719  0+03:00:04
slot10@xxxxxxxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000  4719  0+03:00:04
slot11@xxxxxxxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000  4719  0+03:00:04
slot12@xxxxxxxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000  4719  0+03:00:04
slot13@xxxxxxxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000  4719  0+03:00:04
slot14@xxxxxxxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000  4719  0+03:00:04
slot15@xxxxxxxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000  4719  0+03:00:04
slot16@xxxxxxxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000  4719  0+03:00:04
slot17@xxxxxxxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000  4719  0+03:00:04
slot18@xxxxxxxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000  4719  0+03:00:04
slot19@xxxxxxxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000  4719  0+03:00:04
slot20@xxxxxxxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000  4719  0+03:00:04
slot21@xxxxxxxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000  4719  0+03:00:04
slot22@xxxxxxxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000  4719  0+03:00:04
slot23@xxxxxxxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000  4719  0+03:00:04
slot24@xxxxxxxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000  4719  0+03:00:04
slot25@xxxxxxxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000  4719  0+03:00:04
slot26@xxxxxxxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000  4719  0+03:00:04
slot27@xxxxxxxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000  4719  0+03:00:04
slot28@xxxxxxxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000  4719  0+03:00:04
slot29@xxxxxxxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000  4719  0+03:00:04
slot30@xxxxxxxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000  4719  0+03:00:04
slot31@xxxxxxxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000  4719  0+03:00:04
slot32@xxxxxxxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000  4719  0+03:00:04
slot33@xxxxxxxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000  4719  0+03:00:04
slot34@xxxxxxxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000  4719  0+03:00:04
slot35@xxxxxxxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000  4719  0+03:00:04
slot36@xxxxxxxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000  4719  0+03:00:04
slot37@xxxxxxxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000  4719  0+03:00:04
slot38@xxxxxxxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000  4719  0+03:00:04
slot39@xxxxxxxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000  4719  0+03:00:04
slot40@xxxxxxxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000  4719  0+03:00:04
slot41@xxxxxxxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000  4719  0+03:00:04


[root@clrarcce01 ~]# condor_q -better-analyse  34757

-- Schedd: clrarcce01.in2p3.fr : <134.158.121.102:3682>
The Requirements expression for job 34757.000 is

    ((NumJobStarts == 0) && ((RequestCpus == 8 || RequestCpus == 1))) && (TARGET.Arch == "X86_64") && (TARGET.OpSys == "LINUX") && (TARGET.Disk >= RequestDisk) &&
    (TARGET.Memory >= RequestMemory) && (TARGET.Cpus >= RequestCpus) && (TARGET.HasFileTransfer)

Job 34757.000 defines the following attributes:

    DiskUsage = 2000
    NumJobStarts = 0
    RequestCpus = 8
    RequestDisk = DiskUsage
    RequestMemory = 16000

The Requirements expression for job 34757.000 reduces to these conditions:

         Slots
Step    Matched  Condition
-----  --------  ---------
[5]        1608  TARGET.Arch == "X86_64"
[7]        1608  TARGET.OpSys == "LINUX"
[9]        1608  TARGET.Disk >= RequestDisk
[11]          9  TARGET.Memory >= RequestMemory
[13]          0  TARGET.Cpus >= RequestCpus


34757.000:  Run analysis summary ignoring user priority.  Of 1608 machines,
   1608 are rejected by your job's requirements
      0 reject your job because of their own requirements
      0 match and are already running your jobs
      0 match but are serving other users
      0 are able to run your job

WARNING:  Be advised:
   No machines matched the jobs's constraints



root@clrarcce01 ~]# condor_q -long  34757
Arguments = ""
AutoClusterAttrs = "JobUniverse,LastCheckpointPlatform,MachineLastMatchTime,NumCkpts,ConcurrencyLimits,NiceUser,Rank,Requirements,DiskUsage,NumJobStarts,RequestCpus,RequestDisk,RequestMemory"
AutoClusterId = 176
BufferBlockSize = 32768
BufferSize = 524288
ClusterId = 34757
Cmd = "/var/spool/arc/sessiondir/bSdLDmM9erxnNlgJ7owi1f5nABFKDmABFKDmt53SDmABFKDmvnzXln/condorjob.sh"
CommittedSlotTime = 0
CommittedSuspensionTime = 0
CommittedTime = 0
CompletionDate = 0
CondorPlatform = "$CondorPlatform: x86_64_CentOS7 $"
CondorVersion = "$CondorVersion: 8.8.8 Mar 19 2020 BuildID: 498525 PackageID: 8.8.8-1 $"
CoreSize = -1
CumulativeRemoteSysCpu = 0.0
CumulativeRemoteUserCpu = 0.0
CumulativeSlotTime = 0
CumulativeSuspensionTime = 0
CurrentHosts = 0
DiskUsage = 2000
DiskUsage_RAW = 1897
EncryptExecuteDirectory = false
EnteredCurrentStatus = 1603808947
Environment = ""
Err = "/var/spool/arc/sessiondir/bSdLDmM9erxnNlgJ7owi1f5nABFKDmABFKDmt53SDmABFKDmvnzXln.comment"
ExecutableSize = 17
ExecutableSize_RAW = 16
ExitBySignal = false
ExitStatus = 0
GlobalJobId = "clrarcce01.in2p3.fr#34757.0#1603808947"
ImageSize = 17
ImageSize_RAW = 16
In = "/dev/null"
Iwd = "/var/spool/arc/sessiondir/bSdLDmM9erxnNlgJ7owi1f5nABFKDmABFKDmt53SDmABFKDmvnzXln"
JobCpuLimit = 2764800
JobDescription = "N8ccbe75f_38b1_"
JobLeaseDuration = 2400
JobMemoryLimit = 16384000
JobNotification = 0
JobPrio = 40
JobStatus = 1
JobTimeLimit = 345600
JobUniverse = 5
LastSuspensionTime = 0
LeaveJobInQueue = false
LocalSysCpu = 0.0
LocalUserCpu = 0.0
MaxHosts = 1
MinHosts = 1
MyType = "Job"
NiceUser = false
NordugridQueue = "IN2P3-LPC-ARC"
NumCkpts = 0
NumCkpts_RAW = 0
NumJobCompletions = 0
NumJobStarts = 0
NumRestarts = 0
NumSystemHolds = 0
OnExitHold = false
OnExitRemove = true
Out = "/var/spool/arc/sessiondir/bSdLDmM9erxnNlgJ7owi1f5nABFKDmABFKDmt53SDmABFKDmvnzXln.comment"
Owner = "atlasprd"
PeriodicHold = false
PeriodicRelease = false
PeriodicRemove = (JobStatus == 1 && NumJobStarts > 0) || RemoteUserCpu + RemoteSysCpu > JobCpuLimit || RemoteWallClockTime > JobTimeLimit
ProcId = 0
QDate = 1603808947
Rank = 0.0
RemoteSysCpu = 0.0
RemoteUserCpu = 0.0
RemoteWallClockTime = 0.0
RequestCpus = 8
RequestDisk = DiskUsage
RequestMemory = 16000
Requirements = ((NumJobStarts == 0) && ((RequestCpus == 8 || RequestCpus == 1))) && (TARGET.Arch == "X86_64") && (TARGET.OpSys == "LINUX") && (TARGET.Disk >= RequestDisk) && (TARGET.Memory >= RequestMemory) && (TARGET.Cpus >= RequestCpus) && (TARGET.HasFileTransfer)
RootDir = "/"
ServerTime = 1603814464
ShouldTransferFiles = "YES"
StreamErr = false
StreamOut = false
TargetType = "Machine"
TotalSubmitProcs = 1
TotalSuspensions = 0
TransferIn = false
TransferInput = "/var/spool/arc/sessiondir/bSdLDmM9erxnNlgJ7owi1f5nABFKDmABFKDmt53SDmABFKDmvnzXln"
TransferInputSizeMB = 1
User = "atlasprd@xxxxxxxxxxxxxxxxxxxxx"
UserLog = "/var/spool/arc/sessiondir/bSdLDmM9erxnNlgJ7owi1f5nABFKDmABFKDmt53SDmABFKDmvnzXln/log"
WantCheckpoint = false
WantRemoteIO = true
WantRemoteSyscalls = false
WhenToTransferOutput = "ON_EXIT_OR_EVICT"
x509userproxy = "/var/spool/arc/sessiondir/bSdLDmM9erxnNlgJ7owi1f5nABFKDmABFKDmt53SDmABFKDmvnzXln/user.proxy"
x509UserProxyEmail = "atlas.act1@xxxxxxx"
x509UserProxyExpiration = 1604154055
x509UserProxyFirstFQAN = "/atlas/Role=production/Capability=NULL"
x509UserProxyFQAN = "/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=atlact1/CN=555105/CN=Robot: ATLAS aCT 1,/atlas/Role=production/Capability=NULL,/atlas/Role=NULL/Capability=NULL,/atlas/lcg1/Role=NULL/Capability=NULL"
x509userproxysubject = "/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=atlact1/CN=555105/CN=Robot: ATLAS aCT 1"
x509UserProxyVOName = "atlas"


[root@clrarcce01 ~]#  condor_status -long -startd slot1@xxxxxxxxxxxxxxxxx

AcceptedWhileDraining = false
Activity = "Idle"
AddressV1 = "{[ p=\"primary\"; a=\"134.158.123.160\"; port=9618; n=\"Internet\"; spid=\"11771_8162_3\"; noUDP=true; ], [ p=\"IPv4\"; a=\"134.158.123.160\"; port=9618; n=\"Internet\"; spid=\"11771_8162_3\"; noUDP=true; ]}"
Arch = "X86_64"
AuthenticatedIdentity = "condor_pool@xxxxxxxxxxxxxxxxxxxxx"
AuthenticationMethod = "PASSWORD"
CanHibernate = true
CheckpointPlatform = "LINUX X86_64 3.10.0-1127.19.1.el7.x86_64 normal N/A avx ssse3 sse4_1 sse4_2"
ClockDay = 2
ClockMin = 1023
COLLECTOR_HOST_STRING = "clrhtcmgt.in2p3.fr"
CondorLoadAvg = 0.0
CondorPlatform = "$CondorPlatform: x86_64_CentOS7 $"
CondorVersion = "$CondorVersion: 8.8.8 Mar 19 2020 BuildID: 498525 PackageID: 8.8.8-1 $"
ConsoleIdle = 11208
CpuBusy = ((LoadAvg - CondorLoadAvg) >= 0.5)
CpuBusyTime = 0
CpuCacheSize = 25600
CpuFamily = 6
CpuIsBusy = false
CpuModelNumber = 62
Cpus = 1
CurrentRank = 0.0
DaemonCoreDutyCycle = 0.0002250186936978427
DaemonLastReconfigTime = 1603803491
DaemonStartTime = 1603803491
DetectedCpus = 40
DetectedMemory = 128719
Disk = 42076706
EnteredCurrentActivity = 1603803527
EnteredCurrentState = 1603803499
ExpectedMachineGracefulDrainingBadput = 0
ExpectedMachineGracefulDrainingCompletion = 1603803499
ExpectedMachineQuickDrainingBadput = 0
ExpectedMachineQuickDrainingCompletion = 1603803499
FileSystemDomain = "clrwn333.in2p3.fr"
HardwareAddress = "34:17:eb:e4:65:d4"
has_avx = true
has_sse4_1 = true
has_sse4_2 = true
has_ssse3 = true
HasFileTransfer = true
HasFileTransferPluginMethods = "file,ftp,http,data,https"
HasIOProxy = true
HasJava = true
HasJICLocalConfig = true
HasJICLocalStdin = true
HasJobDeferral = true
HasMPI = true
HasPerFileEncryption = true
HasReconnect = true
HasSelfCheckpointTransfers = true
HasSingularity = true
HasTDP = true
HasTransferInputRemaps = true
HasVM = false
HibernationLevel = 0
HibernationState = "NONE"
HibernationSupportedStates = "S4"
IsLocalStartd = false
IsValidCheckpointPlatform = (TARGET.JobUniverse =!= 1 || ((MY.CheckpointPlatform =!= undefined) && ((TARGET.LastCheckpointPlatform =?= MY.CheckpointPlatform) || (TARGET.NumCkpts == 0))))
IsWakeAble = false
IsWakeOnLanEnabled = false
IsWakeOnLanSupported = true
JavaMFlops = 1792.321777
JavaSpecificationVersion = "1.8"
JavaVendor = "Oracle Corporation"
JavaVersion = "1.8.0_262"
JobPreemptions = 0
JobRankPreemptions = 0
JobStarts = 0
JobUserPrioPreemptions = 0
KeyboardIdle = 11112
KFlops = 1663715
LastBenchmark = 1603803527
LastFetchWorkCompleted = 0
LastFetchWorkSpawned = 0
LastHeardFrom = 1603814603
LastUpdate = 1603803527
LoadAvg = 0.0
Machine = "clrwn333.in2p3.fr"
MachineMaxVacateTime = 10 * 60
MachineResources = "Cpus Memory Disk Swap"
MaxJobRetirementTime = (3600 * 72)
Memory = 4719
Mips = 27211
MonitorSelfAge = 11049
MonitorSelfCPUUsage = 0.0208049811812545
MonitorSelfImageSize = 50448
MonitorSelfRegisteredSocketCount = 0
MonitorSelfResidentSetSize = 8764
MonitorSelfSecuritySessions = 42
MonitorSelfTime = 1603814539
MyAddress = "<134.158.123.160:9618?addrs=134.158.123.160-9618&noUDP&sock=11771_8162_3>"
MyCurrentTime = 1603814603
MyType = "Machine"
Name = "slot1@xxxxxxxxxxxxxxxxx"
NextFetchWorkDelay = -1
NODE_IS_HEALTHY = true
NODE_STATUS = "All_OK"
NumPids = 0
OpSys = "LINUX"
OpSysAndVer = "CentOS7"
OpSysLegacy = "LINUX"
OpSysLongName = "CentOS Linux release 7.8.2003 (Core)"
OpSysMajorVer = 7
OpSysName = "CentOS"
OpSysShortName = "CentOS"
OpSysVer = 708
Rank = 0.0
RecentDaemonCoreDutyCycle = 0.0002347472697799002
RecentJobPreemptions = 0
RecentJobRankPreemptions = 0
RecentJobStarts = 0
RecentJobUserPrioPreemptions = 0
Requirements = (START) && (IsValidCheckpointPlatform)
RetirementTimeRemaining = 0
SingularityVersion = "2.6.1-dist"
SlotID = 1
SlotType = "Static"
SlotTypeID = 0
SlotWeight = Cpus
Start = (NODE_IS_HEALTHY =?= true)
StartdIpAddr = "<134.158.123.160:9618?addrs=134.158.123.160-9618&noUDP&sock=11771_8162_3>"
StarterAbilityList = "HasJava,HasJICLocalStdin,HasJICLocalConfig,HasTDP,HasSingularity,HasPerFileEncryption,HasFileTransfer,HasTransferInputRemaps,HasVM,HasReconnect,HasMPI,HasFileTransferPluginMethods,HasJobDeferral,HasSelfCheckpointTransfers"
State = "Unclaimed"
SubnetMask = "255.255.248.0"
TargetType = "Job"
TimeToLive = 2147483647
TotalCondorLoadAvg = 0.0
TotalCpus = 41.0
TotalDisk = 1725145004
TotalLoadAvg = 0.0
TotalMemory = 193500
TotalSlotCpus = 1
TotalSlotDisk = 42076706.0
TotalSlotMemory = 4719
TotalSlots = 41
TotalTimeUnclaimedBenchmarking = 28
TotalTimeUnclaimedIdle = 11076
TotalVirtualMemory = 232405748
UidDomain = "lcg.clermont.in2p3.fr"
Unhibernate = MY.MachineLastMatchTime =!= undefined
UpdateSequenceNumber = 39
UpdatesHistory = "00000000000000000000000000000000"
UpdatesLost = 0
UpdatesSequenced = 18
UpdatesTotal = 19
UtsnameMachine = "x86_64"
UtsnameNodename = "clrwn333.in2p3.fr"
UtsnameRelease = "3.10.0-1127.19.1.el7.x86_64"
UtsnameSysname = "Linux"
UtsnameVersion = "#1 SMP Tue Aug 25 17:23:54 UTC 2020"
VirtualMemory = 5668432
WakeOnLanEnabledFlags = "NONE"
WakeOnLanSupportedFlags = "Physical Packet,UniCast Packet,MultiCast Packet,BroadCast Packet,Magic Packet"

 
Why my worker ,for exemple, clrwn333.in2p3.fr don't answer to the requirement for this job number  34553 . What can be wrong with my condor configuration.

Any help are welcome

Best regards
Jean-Claude

------------------------------------------------------------------------
Jean-Claude Chevaleyre < Jean-Claude.Chevaleyre(at)clermont.in2p3.fr > 
Laboratoire de Physique Clermont
Campus Universitaire des CÃzeaux
4 Avenue Blaise Pascal
TSA 60026
CS 60026
63178 AubiÃre Cedex

Tel : 04 73 40 73 60

-------------------------------------------------------------------------

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/