[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] ShadowLog not accessible for Shadows, jobs not starting/not ending up in the right slot for their Shadow



just another update:

we have been eyeballed a node when/if its resources/slot get considered
for a job.
For ~45m the node's master/startd daemons are up after a reload [1.a,1.b].
Since then on the Negotiator/Collector not much happened (both in full
debug). The node has not appeared in the NegotiatorLog [2]; according to
the CollectorLog the collector noticed the node disappearing, when the
local master got restarted, and reporting abck again shortly afterwards
[3] - since then not much happened except for occasional Stard class ad
updates.

At the same time there are still about 3k idle jobs waiting to be
considered by the matchmaker - now a number of them being idle for ~100m.

Our current impression is, that the jobs and nodes both announce their
wishes/offers, but both are never really get matched to each other (in a
negotiation cycle) [4].
We have been trying to saturate the cluster with jobs running under the
DESY accounting group, which has a nominally small share but with
surplus allowed should be able to get quite a number of unallocated slots
  GROUP_ACCEPT_SURPLUS = TRUE
but we are not sure, if/how the nagotiator actually brokers the jobs to
the available resources [4.b]

Cheers,
  Thomas



[1.a]
[root@batch0350 /var/log/condor]# tail -f MasterLog
02/05/20 14:45:12 SharedPortEndpoint: did not successfully find
SharedPortServer address. Will retry in 60s.
02/05/20 14:45:12 DaemonCore: private command socket at
<131.169.161.229:0?sock=25772_06df>
02/05/20 14:45:12 Adding SHARED_PORT to DAEMON_LIST, because
USE_SHARED_PORT=true (to disable this, set
AUTO_INCLUDE_SHARED_PORT_IN_DAEMON_LIST=False)
02/05/20 14:45:12 Master restart (GRACEFUL) is watching
/usr/sbin/condor_master (mtime:1577238779)
02/05/20 14:45:12 Collector port not defined, will use default: 9618
02/05/20 14:45:12 Started DaemonCore process
"/usr/libexec/condor/condor_shared_port", pid and pgroup = 25804
02/05/20 14:45:12 Waiting for /var/lock/condor/shared_port_ad to appear.
02/05/20 14:45:13 Found /var/lock/condor/shared_port_ad.
02/05/20 14:45:13 Started DaemonCore process "/usr/sbin/condor_startd",
pid and pgroup = 25808
02/05/20 14:45:23 Setting ready state 'Ready' for STARTD

[1.b]
[root@batch0350 /var/log/condor]# tail -n 20 -f StartLog
02/05/20 14:45:20 Running: /usr/bin/docker container prune -f
--filter=label=org.htcondorproject=True
02/05/20 14:45:20 Failed to read results from '/usr/bin/docker container
prune -f --filter=label=org.htcondorproject=True': 'No such file or
directory' (2)
02/05/20 14:45:20 my_popenv: Failed to exec in child, errno=2 (No such
file or directory)
02/05/20 14:45:20 my_popenv failed
02/05/20 14:45:20 Unable to calculate keyboard/mouse idle time due to
them both being USB or not present, assuming infinite idle time for
these devices.
02/05/20 14:45:20 CronJobList: Adding job 'NODEHEALTH'
02/05/20 14:45:20 CronJob: Initializing job 'NODEHEALTH'
(/etc/condor/tests/healthcheck_wn_condor.sh)
02/05/20 14:45:20 CronJobList: Adding job 'kflops'
02/05/20 14:45:20 CronJobList: Adding job 'mips'
02/05/20 14:45:20 CronJob: Initializing job 'kflops'
(/usr/libexec/condor/condor_kflops)
02/05/20 14:45:20 CronJob: Initializing job 'mips'
(/usr/libexec/condor/condor_mips)
02/05/20 14:45:20 slot1: State change: IS_OWNER is false
02/05/20 14:45:20 slot1: Changing state: Owner -> Unclaimed
02/05/20 14:45:20 State change: RunBenchmarks is TRUE
02/05/20 14:45:20 slot1: Changing activity: Idle -> Benchmarking
02/05/20 14:45:20 BenchMgr:StartBenchmarks()
02/05/20 14:45:23 Initial update sent to collector(s)
02/05/20 14:45:23 Sending DC_SET_READY message to master
<131.169.161.229:9620?addrs=131.169.161.229-9620&noUDP&sock=25772_06df>
02/05/20 14:45:47 State change: benchmarks completed
02/05/20 14:45:47 slot1: Changing activity: Benchmarking -> Idle

[2]
[root@condor01 condor]# condor_status -negotiator
Name             LastCycleEnd (Sec)   Slots Submitrs Schedds    Jobs
Matches Rejections

condor01.desy.de   2/5  15:40     2    2984        1       1       3
   5          1

[root@condor01 condor]# condor_status -collector
Name                                                 Machine
                              RunningJobs IdleJobs HostsTotal

DESY Condor Pool - condor01.desy.de@xxxxxxxxxxxxxxxx condor01.desy.de
                                     2613     3690       2994

[root@condor01 condor]# grep 131.169.161.229 NegotiatorLog | grep
"02/05/20" | tail
[root@condor01 condor]# echo $?
0

[3]
[root@condor01 condor]# grep 131.169.161.229 CollectorLog | grep
"02/05/20" | tail -n 30
02/05/20 13:39:59 		**** Removed(1) ad(s): "< slot1_8@xxxxxxxxxxxxxxxxx
, 131.169.161.229 >"
02/05/20 13:39:59 		**** Removed(1) ad(s): "< slot1_8@xxxxxxxxxxxxxxxxx
, 131.169.161.229 >"
02/05/20 13:39:59 		**** Removed(1) ad(s): "< slot1_2@xxxxxxxxxxxxxxxxx
, 131.169.161.229 >"
02/05/20 13:39:59 		**** Removed(1) ad(s): "< slot1_2@xxxxxxxxxxxxxxxxx
, 131.169.161.229 >"
02/05/20 13:39:59 		**** Removed(1) ad(s): "< slot1@xxxxxxxxxxxxxxxxx ,
131.169.161.229 >"
02/05/20 13:39:59 		**** Removed(1) ad(s): "< slot1@xxxxxxxxxxxxxxxxx ,
131.169.161.229 >"
02/05/20 13:39:59 DaemonCore: Can't receive command request from
131.169.161.229 (perhaps a timeout?)
02/05/20 13:41:59 DaemonCore: Can't receive command request from
131.169.161.229 (perhaps a timeout?)
02/05/20 13:42:11 StartdAd     : Inserting ** "< slot1@xxxxxxxxxxxxxxxxx
, 131.169.161.229 >"
02/05/20 13:42:11 StartdPvtAd  : Inserting ** "< slot1@xxxxxxxxxxxxxxxxx
, 131.169.161.229 >"
02/05/20 14:45:09 		**** Removed(1) ad(s): "< slot1@xxxxxxxxxxxxxxxxx ,
131.169.161.229 >"
02/05/20 14:45:09 		**** Removed(1) ad(s): "< slot1@xxxxxxxxxxxxxxxxx ,
131.169.161.229 >"
02/05/20 14:45:09 DaemonCore: Can't receive command request from
131.169.161.229 (perhaps a timeout?)
02/05/20 14:45:09 DaemonCore: Can't receive command request from
131.169.161.229 (perhaps a timeout?)
02/05/20 14:45:23 StartdAd     : Inserting ** "< slot1@xxxxxxxxxxxxxxxxx
, 131.169.161.229 >"
02/05/20 14:45:23 StartdPvtAd  : Inserting ** "< slot1@xxxxxxxxxxxxxxxxx
, 131.169.161.229 >"
02/05/20 15:00:23 StartdAd     : Updating ... "< slot1@xxxxxxxxxxxxxxxxx
, 131.169.161.229 >"
02/05/20 15:00:23 StartdPvtAd  : Updating ... "< slot1@xxxxxxxxxxxxxxxxx
, 131.169.161.229 >"
02/05/20 15:05:23 StartdAd     : Updating ... "< slot1@xxxxxxxxxxxxxxxxx
, 131.169.161.229 >"
02/05/20 15:05:23 StartdPvtAd  : Updating ... "< slot1@xxxxxxxxxxxxxxxxx
, 131.169.161.229 >"
02/05/20 15:10:24 StartdAd     : Updating ... "< slot1@xxxxxxxxxxxxxxxxx
, 131.169.161.229 >"
02/05/20 15:10:24 StartdPvtAd  : Updating ... "< slot1@xxxxxxxxxxxxxxxxx
, 131.169.161.229 >"
02/05/20 15:15:23 StartdAd     : Updating ... "< slot1@xxxxxxxxxxxxxxxxx
, 131.169.161.229 >"
02/05/20 15:15:23 StartdPvtAd  : Updating ... "< slot1@xxxxxxxxxxxxxxxxx
, 131.169.161.229 >"
02/05/20 15:20:23 StartdAd     : Updating ... "< slot1@xxxxxxxxxxxxxxxxx
, 131.169.161.229 >"
02/05/20 15:20:23 StartdPvtAd  : Updating ... "< slot1@xxxxxxxxxxxxxxxxx
, 131.169.161.229 >"
02/05/20 15:25:23 StartdAd     : Updating ... "< slot1@xxxxxxxxxxxxxxxxx
, 131.169.161.229 >"
02/05/20 15:25:23 StartdPvtAd  : Updating ... "< slot1@xxxxxxxxxxxxxxxxx
, 131.169.161.229 >"
02/05/20 15:30:23 StartdAd     : Updating ... "< slot1@xxxxxxxxxxxxxxxxx
, 131.169.161.229 >"
02/05/20 15:30:23 StartdPvtAd  : Updating ... "< slot1@xxxxxxxxxxxxxxxxx
, 131.169.161.229 >"

[4]
Negotiator and Collector Logs as compressed tarball under
https://desycloud.desy.de/index.php/s/ja7q2g5rcyqn3GC/download

[4.b]
02/05/20 15:47:56 Group group_DESY - BEGIN NEGOTIATION
02/05/20 15:46:57 NEGOTIATOR_STRICT_ENFORCE_QUOTA is true, current
proposed allocation for group_DESY is 3656
...
02/05/20 15:47:56 Round 1 totals: allocated= 6169  usage= 6168
...
02/05/20 15:47:56 group quotas: allocate-surplus (4): group group_DESY
allocated surplus= 2184.66  allocated= 3655  requested= 0
...
02/05/20 15:47:56 group quotas: groups= 9  requesting= 5  served= 5
unserved= 0  slots= 7270  requested= 6168  allocated= 6168  surplus=
10888  maxdelta= 0
AcceptedWhileDraining = false
Activity = "Idle"
AddressV1 = "{[ p=\"primary\"; a=\"131.169.161.229\"; port=9620; n=\"Internet\"; spid=\"25772_06df_3\"; noUDP=true; ], [ p=\"IPv4\"; a=\"131.169.161.229\"; port=9620; n=\"Internet\"; spid=\"25772_06df_3\"; noUDP=true; ]}"
Arch = "X86_64"
AuthenticatedIdentity = "unauthenticated@unmapped"
AuthenticationMethod = undefined
CanHibernate = true
CheckpointPlatform = "LINUX X86_64 3.10.0-1062.7.1.el7.x86_64 normal N/A avx ssse3 sse4_1 sse4_2"
ChildAccountingGroup = {  }
ChildActivity = {  }
ChildCpus = {  }
ChildCurrentRank = {  }
ChildDisk = {  }
ChildEnteredCurrentState = {  }
ChildMemory = {  }
ChildName = {  }
ChildRemoteOwner = {  }
ChildRemoteUser = {  }
ChildRetirementTimeRemaining = {  }
ChildState = {  }
ClockDay = 3
ClockMin = 930
COLLECTOR_HOST_STRING = "condor01.desy.de:9618"
CondorLoadAvg = 0.0
CondorPlatform = "$CondorPlatform: x86_64_RedHat7 $"
CondorVersion = "$CondorVersion: 8.8.7 Dec 24 2019 BuildID: 493225 PackageID: 8.8.7-1 $"
ConsoleIdle = 5332576
ConsumptionCpus = quantize(target.RequestCpus,{ 1 })
ConsumptionDisk = quantize(target.RequestDisk,{ 1024 })
ConsumptionMemory = quantize(target.RequestMemory,{ 128 })
CpuBusy = ((LoadAvg - CondorLoadAvg) >= 0.5)
CpuBusyTime = 0
CpuCacheSize = 2048
CpuFamily = 21
CpuIsBusy = false
CpuModelNumber = 2
Cpus = 64
CurrentRank = 0.0
DaemonCoreDutyCycle = 0.0001180435969121429
DaemonLastReconfigTime = 1580910313
DaemonStartTime = 1580910313
DetectedCpus = 64
DetectedMemory = 257757
DEV_RESOURCE = false
Disk = 958636676
EnteredCurrentActivity = 1580910347
EnteredCurrentState = 1580910320
ExpectedMachineGracefulDrainingBadput = 0
ExpectedMachineGracefulDrainingCompletion = 1580910320
ExpectedMachineQuickDrainingBadput = 0
ExpectedMachineQuickDrainingCompletion = 1580910320
FileSystemDomain = "desy.de"
GRID_RESOURCE = true
HardwareAddress = "00:8c:fa:56:d7:64"
has_avx = true
has_sse4_1 = true
has_sse4_2 = true
has_ssse3 = true
HasFileTransfer = true
HasFileTransferPluginMethods = "file,ftp,http,data,https"
HasIOProxy = true
HasJava = true
HasJICLocalConfig = true
HasJICLocalStdin = true
HasJobDeferral = true
HasMPI = true
HasPerFileEncryption = true
HasReconnect = true
HasSelfCheckpointTransfers = true
HasSingularity = true
HasTDP = true
HasTransferInputRemaps = true
HasVM = false
HibernationLevel = 0
HibernationState = "NONE"
HibernationSupportedStates = "S3,S4"
IsLocalStartd = false
IsValidCheckpointPlatform = (TARGET.JobUniverse =!= 1 || ((MY.CheckpointPlatform =!= undefined) && ((TARGET.LastCheckpointPlatform =?= MY.CheckpointPlatform) || (TARGET.NumCkpts == 0))))
IsWakeAble = true
IsWakeOnLanEnabled = true
IsWakeOnLanSupported = true
JavaMFlops = 881.762756
JavaSpecificationVersion = "1.8"
JavaVendor = "Oracle Corporation"
JavaVersion = "1.8.0_242"
JobPreemptions = 0
JobRankPreemptions = 0
JobStarts = 0
JobUserPrioPreemptions = 0
KeyboardIdle = 44
KFlops = 998883
LastBenchmark = 1580910347
LastFetchWorkCompleted = 0
LastFetchWorkSpawned = 0
LastHeardFrom = 1580913023
LastUpdate = 1580910347
LoadAvg = 0.3
Machine = "batch0350.desy.de"
MachineMaxVacateTime = 1800
MachineResources = "Cpus Memory Disk Swap"
MaxJobRetirementTime = 259200
Memory = 257757
Mips = 17744
MonitorSelfAge = 2648
MonitorSelfCPUUsage = 0.0
MonitorSelfImageSize = 67400
MonitorSelfRegisteredSocketCount = 0
MonitorSelfResidentSetSize = 7312
MonitorSelfSecuritySessions = 67
MonitorSelfTime = 1580912960
MyAddress = "<131.169.161.229:9620?addrs=131.169.161.229-9620&noUDP&sock=25772_06df_3>"
MyCurrentTime = 1580913023
MyType = "Machine"
NAF_RESOURCE = false
Name = "slot1@xxxxxxxxxxxxxxxxx"
NextFetchWorkDelay = -1
NODE_IS_HEALTHY = true
NODE_STATUS = "All_OK"
NumClaims = 64
NumDynamicSlots = 0
NumPids = 0
OLD_RESOURCE = false
OnlyMulticore = false
OpSys = "LINUX"
OpSysAndVer = "CentOS7"
OpSysLegacy = "LINUX"
OpSysLongName = "CentOS Linux release 7.7.1908 (Core)"
OpSysMajorVer = 7
OpSysName = "CentOS"
OpSysShortName = "CentOS"
OpSysVer = 707
PartitionableSlot = true
PslotRollupInformation = true
Rank = 0.0
RecentDaemonCoreDutyCycle = 0.0001122386815403109
RecentJobPreemptions = 0
RecentJobRankPreemptions = 0
RecentJobStarts = 0
RecentJobUserPrioPreemptions = 0
Requirements = (START) && (IsValidCheckpointPlatform) && (WithinResourceLimits)
RetirementTimeRemaining = 0
SingularityVersion = "singularity version 3.5.2-1.1.el7"
SlotID = 1
SlotType = "Partitionable"
SlotTypeID = 1
SlotWeight = Cpus
Start = (NODE_IS_HEALTHY =?= true) && (StartJobs =?= true)
StartdIpAddr = "<131.169.161.229:9620?addrs=131.169.161.229-9620&noUDP&sock=25772_06df_3>"
StarterAbilityList = "HasJava,HasJICLocalStdin,HasJICLocalConfig,HasTDP,HasSingularity,HasPerFileEncryption,HasFileTransfer,HasTransferInputRemaps,HasVM,HasReconnect,HasMPI,HasFileTransferPluginMethods,HasJobDeferral,HasSelfCheckpointTransfers"
StartJobs = true
State = "Unclaimed"
SubnetMask = "255.255.248.0"
TargetType = "Job"
TEST_RESOURCE = true
TimeToLive = 2147483647
TotalCondorLoadAvg = 0.0
TotalCpus = 64.0
TotalDisk = 958636676
TotalLoadAvg = 0.3
TotalMemory = 257757
TotalSlotCpus = 64
TotalSlotDisk = 958636676.0
TotalSlotMemory = 257757
TotalSlots = 1
TotalTimeUnclaimedBenchmarking = 27
TotalTimeUnclaimedIdle = 2676
TotalVirtualMemory = 272064580
UidDomain = "desy.de"
Unhibernate = MY.MachineLastMatchTime =!= undefined
UpdateSequenceNumber = 11
UpdatesHistory = "00000000000000000000000000000000"
UpdatesLost = 0
UpdatesSequenced = 54
UpdatesTotal = 57
UtsnameMachine = "x86_64"
UtsnameNodename = "batch0350.desy.de"
UtsnameRelease = "3.10.0-1062.7.1.el7.x86_64"
UtsnameSysname = "Linux"
UtsnameVersion = "#1 SMP Mon Dec 2 17:33:29 UTC 2019"
VirtualMemory = 272064580
WakeOnLanEnabledFlags = "Magic Packet"
WakeOnLanSupportedFlags = "Physical Packet,UniCast Packet,MultiCast Packet,BroadCast Packet,Magic Packet"
WithinResourceLimits = (ifThenElse(TARGET._cp_orig_RequestCpus =!= undefined,TARGET.RequestCpus <= MY.Cpus,MY.ConsumptionCpus <= MY.Cpus) && ifThenElse(TARGET._cp_orig_RequestDisk =!= undefined,TARGET.RequestDisk <= MY.Disk,MY.ConsumptionDisk <= MY.Disk) && ifThenElse(TARGET._cp_orig_RequestMemory =!= undefined,TARGET.RequestMemory <= MY.Memory,MY.ConsumptionMemory <= MY.Memory))

Attachment: binNjRtg1gseF.bin
Description: Binary data

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature