[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Condor RemoteUserCpu and RemoteSysCpu ClassAdds



Hi Stefano,

 

After a week or so of head scratching, I’m still none the wiser. However, I can confirm some things. All the jobs have a JobStatus = 4

 

I attach the detailed output of one of these jobs for your perusal, of interest is the ClassAdd

 

PeriodicRemove = ( ( RemoteUserCpu + RemoteSysCpu > JobCpuLimit ) ?: false ) || ( ( RemoteWallClockTime > JobTimeLimit ) ?: false )

 

You’re help so far has been invaluable and any further suggestions will be gratefully received. If there are any further debugging techniques I’m able to try, please do let me know as well! To say this has me stumped would be an understatement!

 

Many thanks,

 

Tom Birkett

 

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Thomas Birkett - STFC UKRI <thomas.birkett@xxxxxxxxxx>
Date: Monday, 6 September 2021 at 17:27
To: Stefano Dal Pra <stefano.dalpra@xxxxxxxxxxxx>, HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Condor RemoteUserCpu and RemoteSysCpu ClassAdds

Hi Stefano,

 

Thank you for all your help. I will continue my investigation and report back with what I find!

 

Many thanks,

 

Tom Birkett

 

From: Stefano Dal Pra <stefano.dalpra@xxxxxxxxxxxx>
Organisation: INFN-CNAF
Date: Friday, 3 September 2021 at 16:22
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>, "Birkett, Thomas (STFC,RAL,SC)" <thomas.birkett@xxxxxxxxxx>
Subject: Re: [HTCondor-users] Condor RemoteUserCpu and RemoteSysCpu ClassAdds

 

Hi Thomas,
well, that means that your accounting data are kept into the CumulativeRemoteSysCpu and CumulativeRemoteUserCpu
job classad and for some (unknown to me) reason the ones your accounting considers (RemoteSysCpu, RemoteUserCpu)
happen to be zero sometimes.

If your accounting tool is APEL, it should consider this set of job classad:
GlobalJobId Owner RemoteWallClockTime RemoteUserCpu RemoteSysCpu JobStartDate EnteredCurrentStatus ResidentSetSize_RAW ImageSize_RAW RequestCpus

We do keep CumulativeRemoteSysCpu and CumulativeRemoteUserCpu for cpu accounting instead,
so we would not observe your problem, but even then, i cannot find jobs having
RemoteUserCpu =!= CumulativeRemoteUserCpu here.

Somehow these jobs have RemoteUserCpu reset (as if they were about to restart somewhere else?).
Try adding JobStatus to the previous condor_history query, to see if it always is 4 or 3 or a mix,
try to inspect the full job classad set for a few of these jobs:

condor_history -lim 1 -l 1470220.0

and look for holdreason, lastholdreason, *remove* or alike: maybe you catch a hint on why this happens;
check if SYSTEM_PERIODIC_HOLD or SYSTEM_PERIODIC_REMOVE  might be involved.

Good luck :)

Stefano

On 03/09/21 16:14, Thomas Birkett - STFC UKRI wrote:

Hi Stefano,

 

Thank you for the rapid response, I do indeed get a response. Running this against one of our CE’s we get the following:

 

1470220.0 hyperk046 0.0 25.0 0.0 681.0

1467782.0 tatls002 0.0 362.0 0.0 19160.0

1465443.0 alicesgm 0.0 173.0 0.0 16853.0

1470193.0 hyperk046 0.0 28.0 0.0 843.0

1467760.0 tatls002 0.0 379.0 0.0 9846.0

1470156.0 hyperk046 0.0 27.0 0.0 823.0

1467678.0 tatls002 0.0 212.0 0.0 10323.0

1466269.0 patls036 0.0 1293.0 0.0 49790.0

1470209.0 hyperk046 0.0 28.0 0.0 840.0

1428889.0 tlhcb005 0.0 6286.0 0.0 172552.0

 

Many thanks,

 

Tom Birkett

 

From: Stefano Dal Pra <stefano.dalpra@xxxxxxxxxxxx>
Organisation: INFN-CNAF
Reply to: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Date: Friday, 3 September 2021 at 15:08
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>, "Birkett, Thomas (STFC,RAL,SC)" <thomas.birkett@xxxxxxxxxx>
Subject: Re: [HTCondor-users] Condor RemoteUserCpu and RemoteSysCpu ClassAdds

 

CORRIGE:

condor_history -lim 10 -cons 'jobstatus == 4 && ((RemoteSysCpu =!= CumulativeRemoteSysCpu) || (RemoteUserCpu =!= CumulativeRemoteUserCpu)) ' -af:j Owner RemoteSysCpu CumulativeRemoteSysCpu RemoteUserCpu CumulativeRemoteUserCpu

On 03/09/21 16:04, Stefano Dal Pra wrote:

Hello,
Do you find any result with a search like the following?

condor_history -lim -cons 'jobstatus == 4 && ((RemoteSysCpu =!= CumulativeRemoteSysCpu) || (RemoteUserCpu =!= CumulativeRemoteUserCpu)) ' -af:j Owner RemoteSysCpu CumulativeRemoteSysCpu RemoteUserCpu CumulativeRemoteUserCpu

Stefano


On 03/09/21 11:56, Thomas Birkett - STFC UKRI wrote:

Dear HTCondor-users,

 

I hope you are all keeping well. At RAL we appear to have an issue with our condor jobs detailing incorrect RemoteUserCpu and RemoteSysCpu. What we are currently seeing are jobs completing with a value of zero for the aforementioned ClassAdds. This issue manifested itself after we upgraded our workernodes to Condor 8.8.12 from 8.6.13. We changed no other configuration during the upgrade process. 

 

Currently this issue appears to be affecting 70% of jobs a month according to the accountingDB on our Nordugrid ARC-CE’s and is causing an incorrect efficiency value to be calculated per month.

 

From a Condor perspective, what could be causing this after the Condor version change? I attach a dump of the condor_val_config from one of our workernodes for your perusal. Any help will be gratefully received.

 

Versions:

  1. Condor Central Managers: 8.8.15
  2. NorduGrid ARC-CE’s: 8.6.13
  3. Workernodes: 8.8.15

 

Many thanks,

 

Thomas Birkett

Senior Systems Administrator

Scientific Computing Department  

Science and Technology Facilities Council (STFC)

Rutherford Appleton Laboratory, Chilton, Didcot 
OX11 0QX

 

signature_609518872

 

 

This email and any attachments are intended solely for the use of the named recipients. If you are not the intended recipient you must not use, disclose, copy or distribute this email or any of its attachments and should notify the sender immediately and delete this email from your system. UK Research and Innovation (UKRI) has taken every reasonable precaution to minimise risk of this email or any attachments containing viruses or malware but the recipient should carry out its own virus and malware checks before opening the attachments. UKRI does not accept any liability for any losses or damages which the recipient may sustain due to presence of any viruses. 



_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
 
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/




_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
 
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/



 

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
 
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

 

condor_history -lim 1 -l 1667156.0
BytesSent = 1947766.0
JobFinishedHookDone = 1631627329
MATCH_EXP_MachineRalScaling = "3.474179687500000E+00"
MATCH_EXP_MachineContainerImageName = ""
MATCH_EXP_MachineMesosTaskId = ""
DiskUsage_RAW = 148299
MATCH_EXP_MachineMesosContainerName = ""
Requirements = ( TARGET.HasDocker ) && ( TARGET.Disk >= RequestDisk ) && ( TARGET.Memory >= RequestMemory ) && ( TARGET.Cpus >= RequestCpus ) && ( TARGET.HasFileTransfer ) && ( x509UserProxyVOName =?= "atlas" && NumJobStarts == 0 || x509UserProxyVOName =!= "atlas" )
SpooledOutputFiles = "docker_stderror,l2UMDm1JCkznCIXDjqiBL5XqABFKDmABFKDmTXASDmABFKDm42USFn.diag,.envBefore_asetup.sh-wMtu_n_6382,user-payload.proxy,.envBefore_asetup.sh-ieOWl6_2,user.upatel.364112.Sherpa.DAOD_PHYS.e5271_s3126_r10201_p4355.13-09-21.log.26670774.000002.log"
JobRunCount = 1
LastMatchTime = 1631582535
NumJobMatches = 1
LastJobLeaseRenewal = 1631627329
MachineAttrSlotWeight0 = 1
RemoteWallClockTime = 44794.0
ExitBySignal = false
x509UserProxyVOName = "atlas"
JobStatus = 4
GlobalJobId = "arc-ce01.gridpp.rl.ac.uk#1667156.0#1631582528"
x509UserProxyEmail = "atlas.act1@xxxxxxx"
EnteredCurrentStatus = 1631627329
OrigMaxHosts = 1
x509UserProxyFQAN = "***REDACTED***"
OriginalTransferInput = "/var/spool/arc/grid10/l2UMDm1JCkznCIXDjqiBL5XqABFKDmABFKDmTXASDmABFKDm42USFn"
x509userproxysubject = "***REDACTED***"
MATCH_EXP_MachineVMID = "-1"
x509UserProxyExpiration = 1631916963
LastJobStatus = 2
LastSuspensionTime = 0
DockerImage = "stfc/grid-workernode-c7:2019-07-02.1"
NumShadowStarts = 1
PeriodicRemove = ( ( RemoteUserCpu + RemoteSysCpu > JobCpuLimit ) ?: false ) || ( ( RemoteWallClockTime > JobTimeLimit ) ?: false )
StartdPrincipal = "condor_pool@xxxxxxxxxxxxxxx/130.246.221.177"
CumulativeRemoteSysCpu = 1980.0
RemoteUserCpu = 0.0
ResidentSetSize_RAW = 1616236
WantDocker = true
RemoteSysCpu = 0.0
JobCurrentStartExecutingDate = 1631582537
LastRemoteHost = "slot1_96@xxxxxxxxxxxxxxxxxxxxxxx"
CumulativeRemoteUserCpu = 41520.0
NumJobCompletions = 1
DiskUsage = 150000
ImageSize_RAW = 1578
MemoryUsage = 1578
TerminationPending = true
TransferInput = "/var/spool/arc/grid10/l2UMDm1JCkznCIXDjqiBL5XqABFKDmABFKDmTXASDmABFKDm42USFn,/var/spool/arc/grid10/l2UMDm1JCkznCIXDjqiBL5XqABFKDmABFKDmTXASDmABFKDm42USFn/condorjob.sh"
NetworkInputMb = 0.0
NetworkOutputMb = 0.0
NumJobStarts = 1
ResidentSetSize = 1750000
CommittedSlotTime = 44794.0
CommittedTime = 44794
ExitCode = 0
LastPublicClaimId = "<130.246.221.177:9618?addrs=130.246.221.177-9618&noUDP&sock=2799_75ac_3>#1629104689#30064#..."
CompletionDate = 1631627329
RequestMemory = 2000
MachineAttrCpus0 = 1
RequestDisk = DiskUsage
MinHosts = 1
JobUniverse = 5
RootDir = "/"
JobPrio = 0
ProcId = 0
WantEchoXrootd = ifThenElse(x509UserProxyVOName =?= "atlas" || x509UserProxyVOName =?= "cms" || x509UserProxyVOName =?= "lhcb",true,false)
CondorPlatform = "$CondorPlatform: x86_64_RedHat7 $"
NordugridQueue = "EL7"
CurrentHosts = 0
ScheddHostName = "arc-ce01.gridpp.rl.ac.uk"
RalAcctSubGroup = ifThenElse(regexp("ttcms048",Owner),"sum_tests",ifThenElse(regexp("pcms060",Owner),"sum_tests",ifThenElse(regexp("cmssgm",Owner),"sum_tests",ifThenElse(regexp("tatls016",Owner),"sum_tests",ifThenElse(isPreemptable =?= true,( ifThenElse(x509UserProxyVOName =?= "alice","alice",ifThenElse(x509UserProxyVOName =?= "atlas","atlas",ifThenElse(x509UserProxyVOName =?= "cms","cms",ifThenElse(x509UserProxyVOName =?= "lhcb","lhcb","other")))) ),ifThenElse(regexp("pcms",Owner) && RequestCpus > 1,"prodcms_multicore",ifThenElse(regexp("pcms",Owner),"prodcms",ifThenElse(regexp("ttcms",Owner) && RequestCpus > 1,"cms_pilot_multicore",ifThenElse(regexp("ttcms",Owner),"cms_pilot",ifThenElse(regexp("cms",Owner),"cms",ifThenElse(regexp("patl",Owner) && RequestCpus > 1,"prodatls_multicore",ifThenElse(regexp("patl",Owner),"prodatls",ifThenElse(regexp("tatl",Owner) && RequestCpus > 1,"atlas_pilot_multicore",ifThenElse(regexp("tatl",Owner),"atlas_pilot",ifThenElse(regexp("atl",Owner),"atlas",ifThenElse(regexp("tlhcb",Owner),"lhcb_pilot",ifThenElse(regexp("plhcb",Owner),"prodlhcb",ifThenElse(regexp("lhcb",Owner),"lhcb",ifThenElse(regexp("talce",Owner),"alice_pilot",ifThenElse(regexp("alice",Owner),"alice",ifThenElse(regexp("dteam",Owner),"dteam",ifThenElse(regexp("ops",Owner),"ops",ifThenElse(regexp("nagios",Owner),"nagios",ifThenElse(regexp("hone",Owner),"h1",ifThenElse(regexp("ilc",Owner),"ilc",ifThenElse(regexp("mice",Owner),"mice",ifThenElse(regexp("na62",Owner),"spsna62",ifThenElse(regexp("pheno",Owner),"pheno",ifThenElse(regexp("suprb",Owner),"Super_B",ifThenElse(regexp("superbpm",Owner),"superbpm",ifThenElse(regexp("t2k",Owner),"t2k",ifThenElse(regexp("snopluspm",Owner),"snopluspm",ifThenElse(regexp("snoplus",Owner),"snoplus",ifThenElse(regexp("bio",Owner),"bio",ifThenElse(regexp("fusn",Owner),"fusn",ifThenElse(regexp("epic",Owner),"epic",ifThenElse(regexp("geant",Owner),"geant",ifThenElse(regexp("glast",Owner),"glast",ifThenElse(regexp("hyperk",Owner),"hyperk",ifThenElse(regexp("enmr",Owner),"enmr",ifThenElse(regexp("lsst",Owner),"lsst",ifThenElse(regexp("ligo",Owner),"ligo",ifThenElse(regexp("gpp",Owner),"gridpp",ifThenElse(regexp("ska",Owner),"ska","none"))))))))))))))))))))))))))))))))))))))))))))
JobDescription = "user_upatel_364"
Iwd = "/var/spool/arc/grid10/l2UMDm1JCkznCIXDjqiBL5XqABFKDmABFKDmTXASDmABFKDm42USFn"
LocalSysCpu = 0.0
WhenToTransferOutput = "ON_EXIT_OR_EVICT"
ClusterId = 1667156
RalAcctGroup = ifThenElse(regexp("ttcms048",Owner),"group_HIGHPRIO",ifThenElse(regexp("pcms060",Owner),"group_HIGHPRIO",ifThenElse(regexp("cmssgm",Owner),"group_HIGHPRIO",ifThenElse(regexp("plhcb024",Owner),"group_HIGHPRIO",ifThenElse(regexp("tatls016",Owner),"group_HIGHPRIO",ifThenElse(isPreemptable =?= true,"group_PREEMPTABLE",ifThenElse(x509UserProxyVOName =?= "alice","group_ALICE",ifThenElse(x509UserProxyVOName =?= "atlas","group_ATLAS",ifThenElse(x509UserProxyVOName =?= "cms","group_CMS",ifThenElse(x509UserProxyVOName =?= "lhcb","group_LHCB",ifThenElse(x509UserProxyVOName =?= "dteam","group_DTEAM_OPS",ifThenElse(x509UserProxyVOName =?= "ops","group_DTEAM_OPS",ifThenElse(regexp("nagios",Owner),"group_OTHER","group_NONLHC")))))))))))))
NumCkpts = 0
NumCkpts_RAW = 0
JobStartDate = 1631582535
MachineVMID = "$$([ifThenElse(isUndefined(VMID), -1, VMID)])"
MachineRalScaling = "$$([ifThenElse(isUndefined(RalScaling), ifThenElse(isUndefined(ScalingFactor), 1.00, ScalingFactor), RalScaling)])"
TransferIn = false
MachineContainerImageName = "$$([ifThenElse(isUndefined(CONTAINER_IMAGE_NAME), \"\", CONTAINER_IMAGE_NAME)])"
MachineMesosTaskId = "$$([ifThenElse(isUndefined(MESOS_TASK_ID), \"\", MESOS_TASK_ID)])"
RequestCpus = 1
JobLeaseDuration = 2400
QDate = 1631582528
ConcurrencyLimits = strcat(RalAcctGroup,",",RalAcctSubGroup,",",Owner)
CommittedSuspensionTime = 0
BytesRecvd = 1952489.0
EncryptExecuteDirectory = false
CumulativeSlotTime = 44794.0
MachineMesosContainerName = "$$([ifThenElse(isUndefined(MESOS_CONTAINER_NAME), \"\", MESOS_CONTAINER_NAME)])"
ExecutableSize_RAW = 17
CumulativeSuspensionTime = 0
MyType = "Job"
Rank = 0.0
NumSystemHolds = 0
NumRestarts = 0
TransferInputSizeMB = 1
NiceUser = false
AccountingGroup = strcat(RalAcctGroup,".",RalAcctSubGroup,".",Owner)
LocalUserCpu = 0.0
ExecutableSize = 17
Owner = "tatls002"
JobNotification = 0
User = "tatls002@xxxxxxxxxxxxxxx"
Out = "/var/spool/arc/grid10/l2UMDm1JCkznCIXDjqiBL5XqABFKDmABFKDmTXASDmABFKDm42USFn.comment"
BufferSize = 524288
LeaveJobInQueue = false
Arguments = ""
WantCheckpoint = false
JobCurrentStartDate = 1631582535
JobCpuLimit = 345600
TargetType = "Machine"
OnExitHold = false
CoreSize = -1
PeriodicHold = false
OnExitRemove = true
BufferBlockSize = 32768
Err = "/var/spool/arc/grid10/l2UMDm1JCkznCIXDjqiBL5XqABFKDmABFKDmTXASDmABFKDm42USFn.comment"
JobMemoryLimit = 2048000
UserLog = "/var/spool/arc/grid10/l2UMDm1JCkznCIXDjqiBL5XqABFKDmABFKDmTXASDmABFKDm42USFn/log"
StreamErr = false
JobTimeLimit = 345600
CondorVersion = "$CondorVersion: 8.6.13 Oct 30 2018 BuildID: 453497 $"
x509userproxy = "/var/spool/arc/grid10/l2UMDm1JCkznCIXDjqiBL5XqABFKDmABFKDmTXASDmABFKDm42USFn/user.proxy"
PeriodicRelease = false
In = "/dev/null"
StreamOut = false
WantRemoteIO = true
Environment = ""
x509UserProxyFirstFQAN = "/atlas/Role=pilot/Capability=NULL"
ImageSize = 1750
WantRemoteSyscalls = false
Cmd = "/var/spool/arc/grid10/l2UMDm1JCkznCIXDjqiBL5XqABFKDmABFKDmTXASDmABFKDm42USFn/condorjob.sh"
TotalSuspensions = 0
ShouldTransferFiles = "YES"
ExitStatus = 0
JobCurrentStartTransferOutputDate = 1631627327
MaxHosts = 1
TotalSubmitProcs = 1