[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Condor RemoteUserCpu and RemoteSysCpu ClassAds



Hi all,

 

Hopefully a small development with our ongoing CPU ClassAd issue. I’ve noticed a pattern with jobs that don’t display any accounting data. Focusing on the Out= ClassAd and looking inside the defined file I notice the following errors:

 

Detecting resource accounting method available for the job.

Looking for /usr/bin/time tool for accounting measurements

GNU time found and will be used for job accounting.

mv: cannot stat ‘/pool/condor/dir_13082/xSnKDmU3gsznCIXDjqiBL5XqABFKDmABFKDm9ZgQDmABFKDmh4zr7m/log.26878966._000008.job.log.1’: No such file or directory

mv: cannot stat ‘/pool/condor/dir_13082/xSnKDmU3gsznCIXDjqiBL5XqABFKDmABFKDm9ZgQDmABFKDmh4zr7m/gmlog’: No such file or directory

 

Does anyone know what may produce these errors? I’ve not seen these before. This may be a Red Herring but what function does the gmlog file supply?

 

Many thanks,

 

Tom Birkett

 

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Thomas Birkett - STFC UKRI <thomas.birkett@xxxxxxxxxx>
Date: Tuesday, 14 September 2021 at 15:13
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>, Stefano Dal Pra <stefano.dalpra@xxxxxxxxxxxx>
Subject: Re: [HTCondor-users] Condor RemoteUserCpu and RemoteSysCpu ClassAdds

Hi Stefano,

 

After a week or so of head scratching, I’m still none the wiser. However, I can confirm some things. All the jobs have a JobStatus = 4

 

I attach the detailed output of one of these jobs for your perusal, of interest is the ClassAdd

 

PeriodicRemove = ( ( RemoteUserCpu + RemoteSysCpu > JobCpuLimit ) ?: false ) || ( ( RemoteWallClockTime > JobTimeLimit ) ?: false )

 

You’re help so far has been invaluable and any further suggestions will be gratefully received. If there are any further debugging techniques I’m able to try, please do let me know as well! To say this has me stumped would be an understatement!

 

Many thanks,

 

Tom Birkett

 

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Thomas Birkett - STFC UKRI <thomas.birkett@xxxxxxxxxx>
Date: Monday, 6 September 2021 at 17:27
To: Stefano Dal Pra <stefano.dalpra@xxxxxxxxxxxx>, HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Condor RemoteUserCpu and RemoteSysCpu ClassAdds

Hi Stefano,

 

Thank you for all your help. I will continue my investigation and report back with what I find!

 

Many thanks,

 

Tom Birkett

 

From: Stefano Dal Pra <stefano.dalpra@xxxxxxxxxxxx>
Organisation: INFN-CNAF
Date: Friday, 3 September 2021 at 16:22
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>, "Birkett, Thomas (STFC,RAL,SC)" <thomas.birkett@xxxxxxxxxx>
Subject: Re: [HTCondor-users] Condor RemoteUserCpu and RemoteSysCpu ClassAdds

 

Hi Thomas,
well, that means that your accounting data are kept into the CumulativeRemoteSysCpu and CumulativeRemoteUserCpu
job classad and for some (unknown to me) reason the ones your accounting considers (RemoteSysCpu, RemoteUserCpu)
happen to be zero sometimes.

If your accounting tool is APEL, it should consider this set of job classad:
GlobalJobId Owner RemoteWallClockTime RemoteUserCpu RemoteSysCpu JobStartDate EnteredCurrentStatus ResidentSetSize_RAW ImageSize_RAW RequestCpus

We do keep CumulativeRemoteSysCpu and CumulativeRemoteUserCpu for cpu accounting instead,
so we would not observe your problem, but even then, i cannot find jobs having
RemoteUserCpu =!= CumulativeRemoteUserCpu here.

Somehow these jobs have RemoteUserCpu reset (as if they were about to restart somewhere else?).
Try adding JobStatus to the previous condor_history query, to see if it always is 4 or 3 or a mix,
try to inspect the full job classad set for a few of these jobs:

condor_history -lim 1 -l 1470220.0

and look for holdreason, lastholdreason, *remove* or alike: maybe you catch a hint on why this happens;
check if SYSTEM_PERIODIC_HOLD or SYSTEM_PERIODIC_REMOVE  might be involved.

Good luck :)

Stefano

On 03/09/21 16:14, Thomas Birkett - STFC UKRI wrote:

Hi Stefano,

 

Thank you for the rapid response, I do indeed get a response. Running this against one of our CE’s we get the following:

 

1470220.0 hyperk046 0.0 25.0 0.0 681.0

1467782.0 tatls002 0.0 362.0 0.0 19160.0

1465443.0 alicesgm 0.0 173.0 0.0 16853.0

1470193.0 hyperk046 0.0 28.0 0.0 843.0

1467760.0 tatls002 0.0 379.0 0.0 9846.0

1470156.0 hyperk046 0.0 27.0 0.0 823.0

1467678.0 tatls002 0.0 212.0 0.0 10323.0

1466269.0 patls036 0.0 1293.0 0.0 49790.0

1470209.0 hyperk046 0.0 28.0 0.0 840.0

1428889.0 tlhcb005 0.0 6286.0 0.0 172552.0

 

Many thanks,

 

Tom Birkett

 

From: Stefano Dal Pra <stefano.dalpra@xxxxxxxxxxxx>
Organisation: INFN-CNAF
Reply to: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Date: Friday, 3 September 2021 at 15:08
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>, "Birkett, Thomas (STFC,RAL,SC)" <thomas.birkett@xxxxxxxxxx>
Subject: Re: [HTCondor-users] Condor RemoteUserCpu and RemoteSysCpu ClassAdds

 

CORRIGE:

condor_history -lim 10 -cons 'jobstatus == 4 && ((RemoteSysCpu =!= CumulativeRemoteSysCpu) || (RemoteUserCpu =!= CumulativeRemoteUserCpu)) ' -af:j Owner RemoteSysCpu CumulativeRemoteSysCpu RemoteUserCpu CumulativeRemoteUserCpu

On 03/09/21 16:04, Stefano Dal Pra wrote:

Hello,
Do you find any result with a search like the following?

condor_history -lim -cons 'jobstatus == 4 && ((RemoteSysCpu =!= CumulativeRemoteSysCpu) || (RemoteUserCpu =!= CumulativeRemoteUserCpu)) ' -af:j Owner RemoteSysCpu CumulativeRemoteSysCpu RemoteUserCpu CumulativeRemoteUserCpu

Stefano


On 03/09/21 11:56, Thomas Birkett - STFC UKRI wrote:

Dear HTCondor-users,

 

I hope you are all keeping well. At RAL we appear to have an issue with our condor jobs detailing incorrect RemoteUserCpu and RemoteSysCpu. What we are currently seeing are jobs completing with a value of zero for the aforementioned ClassAdds. This issue manifested itself after we upgraded our workernodes to Condor 8.8.12 from 8.6.13. We changed no other configuration during the upgrade process. 

 

Currently this issue appears to be affecting 70% of jobs a month according to the accountingDB on our Nordugrid ARC-CE’s and is causing an incorrect efficiency value to be calculated per month.

 

From a Condor perspective, what could be causing this after the Condor version change? I attach a dump of the condor_val_config from one of our workernodes for your perusal. Any help will be gratefully received.

 

Versions:

  1. Condor Central Managers: 8.8.15
  2. NorduGrid ARC-CE’s: 8.6.13
  3. Workernodes: 8.8.15

 

Many thanks,

 

Thomas Birkett

Senior Systems Administrator

Scientific Computing Department  

Science and Technology Facilities Council (STFC)

Rutherford Appleton Laboratory, Chilton, Didcot 
OX11 0QX

 

signature_609518872

 

 

This email and any attachments are intended solely for the use of the named recipients. If you are not the intended recipient you must not use, disclose, copy or distribute this email or any of its attachments and should notify the sender immediately and delete this email from your system. UK Research and Innovation (UKRI) has taken every reasonable precaution to minimise risk of this email or any attachments containing viruses or malware but the recipient should carry out its own virus and malware checks before opening the attachments. UKRI does not accept any liability for any losses or damages which the recipient may sustain due to presence of any viruses. 

 

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
 
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/



_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
 
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

 

 

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
 
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/