[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] [Condor-users] Evicted jobs in Idle due to RequestDisk Update



Hello, we are using HTCondor 8.4.4 and are experiencing issues when jobs get evicted due to the computing machine being turned off or if the scheduling machine loses contact with the computing instance and is unable to reconnect before rescheduling the job, for some reason the jobs become idle and remain in idle even though there are available computing instances to take the job.  We are using partitionable slots on our computing machines.  From investigating it appears that the matching is failing due to the requirement TARGET.Disk >= RequestDisk.  The condor.submit for this job does not have a RequestDisk specified, the job only specifies the required number of CPUs and Memory.  The jobs run on EC2 instances that have local disks of size 1TB and share an EFS volume of size 8EB, the jobs that run on these machines write to both of these locations.  My primary confusion arises from the fact that the post-eviction job_ad now has a DiskUsage = 42500000 and specifies RequestDisk = 42500096 while the slot_ad advertises 42498100 for its Disk (the job_ad and slot_ad were found on a worker that normally runs the job correctly).  This is clearly where the requirements are failing but I have checked the machine and it has almost the entire 1 TB of it’s local drive free so I don’t understand why the slot_ad is being limited to Disk which is just slightly lower than what is needed by the job.  I have provided output below:

 

condor_q -better-analyze:

 

-- Schedd: htcondorscheduler1.localdomain : <10.122.225.127:38108?...

User priority for condor@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx is not available, attempting to analyze without it.

---

5521.000:  Run analysis summary.  Of 19 machines,

     16 are rejected by your job's requirements

      0 reject your job because of their own requirements

      0 match and are already running your jobs

      0 match but are serving other users

      0 are available to run your job

                Last successful match: Wed Apr 19 14:15:23 2017

                Last failed match: Wed Apr 19 14:34:43 2017

 

                Reason for last match failure: no match found

 

The Requirements _expression_ for your job is:

 

    ( HAS_DOCKER && HAS_RCP_DFS && target.machine isnt MachineAttrMachine1 &&

      target.machine isnt MachineAttrMachine2 ) &&

    ( TARGET.Arch == "X86_64" ) && ( TARGET.OpSys == "LINUX" ) &&

    ( TARGET.Disk >= RequestDisk ) && ( TARGET.Memory >= RequestMemory ) &&

    ( ( TARGET.HasFileTransfer ) ||

      ( TARGET.FileSystemDomain == MY.FileSystemDomain ) )

 

Your job defines the following attributes:

 

    DiskUsage = 42500000

    FileSystemDomain = "htcondorscheduler1.localdomain"

    RequestDisk = 42500000

    RequestMemory = 20480

 

The Requirements _expression_ for your job reduces to these conditions:

 

         Slots

Step    Matched  Condition

-----  --------  ---------

[0]          18  HAS_DOCKER

[1]          18  HAS_RCP_DFS

[9]          18  TARGET.OpSys == "LINUX"

[11]          8  TARGET.Disk >= RequestDisk

[13]          3  TARGET.Memory >= RequestMemory

[15]         19  TARGET.HasFileTransfer

 

Suggestions:

 

    Condition                         Machines Matched    Suggestion

    ---------                         ----------------    ----------

1   HAS_DOCKER                        0                   REMOVE

2   HAS_RCP_DFS                       0                   REMOVE

3   ( TARGET.Memory >= 20480 )        3                   

4   ( TARGET.Disk >= 42500000 )       8                   

5   ( TARGET.OpSys == "LINUX" )       18                  

6   target.machine isnt MachineAttrMachine119                  

7   target.machine isnt MachineAttrMachine219                  

8   ( TARGET.Arch == "X86_64" )       19                  

9   ( ( TARGET.HasFileTransfer ) || ( TARGET.FileSystemDomain == "htcondorscheduler1.localdomain" ) )

 

 

Job_ad (From the worker machine):

Arguments = "426465790066.dkr.ecr.us-east-1.amazonaws.com/ai-terrestrial_pipeline_all:4.1.17 terrestrial-stage_1400 --workflow=workflow.json"

AutoClusterAttrs = "ConcurrencyLimits,NiceUser,Rank,Requirements,_condor_RequestCpus,_condor_RequestDisk,_condor_RequestMemory,JobUniverse,LastCheckpointPlatform,NumCkpts,RequestCpus,RequestDisk,RequestMemory,MachineLastMatchTime,DiskUsage,FileSystemDomain"

AutoClusterId = 13

BufferBlockSize = 32768

BufferSize = 524288

BytesRecvd = 9386.0

BytesSent = 0.0

ClusterId = 5521

Cmd = "/usr/local/bin/condor-docker"

CommittedSlotTime = 0

CommittedSuspensionTime = 0

CommittedTime = 0

CompletionDate = 0

CondorPlatform = "$CondorPlatform: x86_64_Ubuntu14 $"

CondorVersion = "$CondorVersion: 8.4.4 Feb 03 2016 BuildID: 355883 $"

CoreSize = 0

CpusProvisioned = 36

CumulativeSlotTime = 492.0

CumulativeSuspensionTime = 0

CurrentHosts = 0

DAGManJobId = 5519

DAGManNodesLog = "/disk-root/condor/execute/HT028_1407138585/260/./HT028_1407138585.dagman.nodes.log"

DAGManNodesMask = "0,1,2,4,5,7,9,10,11,12,13,16,17,24,27"

DAGNodeName = "stage_1400"

DAGParentNodeNames = "stage_1100"

DiskProvisioned = 1073184440

DiskUsage = 42500000

DiskUsage_RAW = 40584772

EncryptExecuteDirectory = false

EnteredCurrentStatus = 1492611815

Environment = ""

Err = "job.stderr.5521"

ExecutableSize = 0

ExecutableSize_RAW = 0

ExitBySignal = false

ExitStatus = 0

FileSystemDomain = "htcondorscheduler1.localdomain"

GlobalJobId = "htcondorscheduler1.localdomain#5521.0#1492611317"

ImageSize = 750000

ImageSize_RAW = 623124

In = "/dev/null"

Iwd = "/disk-root/condor/execute/HT028_1407138585/260/stage_1400"

JobCurrentStartDate = 1492611323

JobCurrentStartExecutingDate = 1492611323

JobLeaseDuration = 600

JobMachineAttrs = "Machine"

JobMachineAttrsHistoryLength = 5

JobNotification = 0

JobPrio = 1400

JobRunCount = 1

JobStartDate = 1492611323

JobStatus = 1

JobUniverse = 5

KeepClaimIdle = 20

LastJobLeaseRenewal = 1492611814

LastJobStatus = 2

LastMatchTime = 1492611323

LastPublicClaimId = "<10.122.226.188:58203>#1492611105#1#..."

LastRejMatchReason = "no match found "

LastRejMatchTime = 1492612483

LastRemoteHost = "slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx"

LastSuspensionTime = 0

LastVacateTime = 1492611815

LeaveJobInQueue = false

LocalSysCpu = 0.0

LocalUserCpu = 0.0

MachineAttrCpus0 = 1

MachineAttrMachine0 = "ip-10-122-226-188.localdomain"

MachineAttrSlotWeight0 = 1

MaxHosts = 1

MemoryProvisioned = 60387

MemoryUsage = ( ( ResidentSetSize + 1023 ) / 1024 )

MinHosts = 1

MyType = "Job"

NiceUser = false

NumCkpts = 0

NumCkpts_RAW = 0

NumJobMatches = 1

NumJobStarts = 1

NumRestarts = 0

NumShadowStarts = 1

NumSystemHolds = 0

>

>

OrigMaxHosts = 1

Out = "job.stdout.5521"

Owner = "condor"

PeriodicHold = false

PeriodicRelease = false

PeriodicRemove = ( ( JobStatus == 5 ) && ( CurrentTime - EnteredCurrentStatus ) > 300 )

ProcId = 0

ProvisionedResources = "Cpus Memory Disk Swap"

QDate = 1492611317

Rank = 0.0

RemoteAutoregroup = false

RemoteNegotiatingGroup = "<none>"

RemoteSysCpu = 0.0

RemoteUserCpu = 0.0

RemoteWallClockTime = 492.0

RequestCpus = 1

RequestDisk = 42500096

RequestMemory = 20480

Requirements = ( HAS_DOCKER && HAS_RCP_DFS && target.machine =!= MachineAttrMachine1 && target.machine =!= MachineAttrMachine2 ) && ( TARGET.Arch == "X86_64" ) && ( TARGET.OpSys == "LINUX" ) && ( TARGET.Disk >= RequestDisk ) && ( TARGET.Memory >= RequestMemory ) && ( ( TARGET.HasFileTransfer ) || ( TARGET.FileSystemDomain == MY.FileSystemDomain ) )

ResidentSetSize = 12500

ResidentSetSize_RAW = 12192

RootDir = "/"

ServerTime = 1492612963

ShouldTransferFiles = "IF_NEEDED"

StartdPrincipal = "execute-side@matchsession/10.122.226.188"

StartdSendsAlives = true

StreamErr = false

StreamOut = false

SubmitEventNotes = "DAG Node: stage_1400"

TargetType = "Machine"

TotalSuspensions = 0

TransferExecutable = false

TransferIn = false

TransferInput = "../workflow.json"

TransferInputSizeMB = 0

TransferOutput = "OUT"

User = "condor@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"

UserLog = "/disk-root/condor/execute/HT028_1407138585/260/stage_1400/job.log"

WantCheckpoint = false

WantRemoteIO = true

WantRemoteSyscalls = false

WhenToTransferOutput = "ON_EXIT"

_condor_SEND_LEFTOVERS = false

_condor_SEND_PAIRED_SLOT = true

_condor_StartdHandlesAlives = true

 

Slot_ad (From the worker machine):

Activity = "Idle"

AddressV1 = "{[ p=\"primary\"; a=\"10.122.225.241\"; port=37559; n=\"Internet\"; ], [ p=\"IPv4\"; a=\"10.122.225.241\"; port=37559; n=\"Internet\"; ]}"

Arch = "X86_64"

CLAIM_WORKLIFE = 1200

COLLECTOR_HOST_STRING = "10.122.225.105"

CONTINUE = true

CanHibernate = true

CheckpointPlatform = "LINUX X86_64 3.13.0-91-generic normal 0x2aaaaaaab000 ssse3 sse4_1 sse4_2"

ClockDay = 3

ClockMin = 882

CondorLoadAvg = 0.0

CondorPlatform = "$CondorPlatform: x86_64_Ubuntu14 $"

CondorVersion = "$CondorVersion: 8.4.4 Feb 03 2016 BuildID: 355883 $"

ConsoleIdle = 359

CpuBusy = ( ( LoadAvg - CondorLoadAvg ) >= 0.5 )

CpuBusyTime = 0

CpuIsBusy = false

Cpus = 1

CurrentRank = 0.0

DaemonCoreDutyCycle = -0.1805678648829709

DetectedCpus = 36

DetectedMemory = 60387

Disk = 42498100

DynamicSlot = true

EnteredCurrentActivity = 1492612964

EnteredCurrentState = 1492612964

ExpectedMachineGracefulDrainingBadput = 0

ExpectedMachineGracefulDrainingCompletion = 1492612605

ExpectedMachineQuickDrainingBadput = 0

ExpectedMachineQuickDrainingCompletion = 1492612605

FileSystemDomain = "ip-10-122-225-241.localdomain"

HAS_AWS = true

HAS_DOCKER = true

HAS_RCP_DFS = true

HardwareAddress = "12:fc:4f:64:cc:26"

HasCheckpointing = true

HasEncryptExecuteDirectory = true

HasFileTransfer = true

HasFileTransferPluginMethods = "file,ftp,http,data"

HasIOProxy = true

HasJICLocalConfig = true

HasJICLocalStdin = true

HasJobDeferral = true

HasMPI = true

HasPerFileEncryption = true

HasReconnect = true

HasRemoteSyscalls = true

HasTDP = true

HasVM = false

HibernationLevel = 0

HibernationState = "NONE"

HibernationSupportedStates = "S3,S4,S5"

IsLocalStartd = false

IsOwner = ( START =?= false )

IsValidCheckpointPlatform = ( TARGET.JobUniverse =!= 1 || ( ( MY.CheckpointPlatform =!= undefined ) && ( ( TARGET.LastCheckpointPlatform =?= MY.CheckpointPlatform ) || ( TARGET.NumCkpts == 0 ) ) ) )

IsWakeAble = false

IsWakeOnLanEnabled = false

IsWakeOnLanSupported = false

JobPreemptions = 0

JobRankPreemptions = 0

JobStarts = 0

JobUserPrioPreemptions = 0

KFlops = 1750755

KILL = false

KeyboardIdle = 359

LastBenchmark = 1492612631

LastFetchWorkCompleted = 0

LastFetchWorkSpawned = 0

LastUpdate = 1492612631

LoadAvg = 0.0

Machine = "ip-10-122-225-241.localdomain"

MachineMaxVacateTime = 10 * 60

MachineResources = "Cpus Memory Disk Swap"

MaxJobRetirementTime = 0

Memory = 20480

Mips = 24337

MonitorSelfAge = 241

MonitorSelfCPUUsage = 0.008310156277141429

MonitorSelfImageSize = 45312

MonitorSelfRegisteredSocketCount = 1

MonitorSelfResidentSetSize = 6212

MonitorSelfSecuritySessions = 3

MonitorSelfTime = 1492612845

MyAddress = "<10.122.225.241:37559?addrs=10.122.225.241-37559>"

MyCurrentTime = 1492612964

MyType = "Machine"

Name = "slot1_1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx"

NextFetchWorkDelay = -1

NumPids = 0

OpSys = "LINUX"

OpSysAndVer = "Ubuntu14"

OpSysLegacy = "LINUX"

OpSysLongName = "Ubuntu 14.04.4 LTS"

OpSysMajorVer = 14

OpSysName = "Ubuntu"

OpSysShortName = "Ubuntu"

OpSysVer = 1404

PERIODIC_CHECKPOINT = ( ( time() - LastPeriodicCheckpoint ) / 60.0 ) > ( 180.0 + -7 )

PREEMPT = ( false ) || ( TotalDisk < 1000000 )

ParentSlotId = 1

PrivateNetworkName = "ip-10-122-225-241.localdomain"

PslotRollupInformation = true

Rank = 0.0

RemoteAutoregroup = false

RemoteNegotiatingGroup = "<none>"

RemoteSysCpu = 0.0

RemoteUserCpu = 0.0

RemoteWallClockTime = 492.0

RequestCpus = 1

RequestDisk = 42500096

RequestMemory = 20480

Requirements = ( HAS_DOCKER && HAS_RCP_DFS && target.machine =!= MachineAttrMachine1 && target.machine =!= MachineAttrMachine2 ) && ( TARGET.Arch == "X86_64" ) && ( TARGET.OpSys == "LINUX" ) && ( TARGET.Disk >= RequestDisk ) && ( TARGET.Memory >= RequestMemory ) && ( ( TARGET.HasFileTransfer ) || ( TARGET.FileSystemDomain == MY.FileSystemDomain ) )

ResidentSetSize = 12500

ResidentSetSize_RAW = 12192

RootDir = "/"

ServerTime = 1492612963

ShouldTransferFiles = "IF_NEEDED"

StartdPrincipal = "execute-side@matchsession/10.122.226.188"

StartdSendsAlives = true

StreamErr = false

StreamOut = false

SubmitEventNotes = "DAG Node: stage_1400"

TargetType = "Machine"

TotalSuspensions = 0

TransferExecutable = false

TransferIn = false

TransferInput = "../workflow.json"

TransferInputSizeMB = 0

TransferOutput = "OUT"

User = "condor@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"

UserLog = "/disk-root/condor/execute/HT028_1407138585/260/stage_1400/job.log"

WantCheckpoint = false

WantRemoteIO = true

WantRemoteSyscalls = false

WhenToTransferOutput = "ON_EXIT"

_condor_SEND_LEFTOVERS = false

_condor_SEND_PAIRED_SLOT = true

_condor_StartdHandlesAlives = true