Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Understanding condor_userlog output for self-checkpointing apps

Date: Fri, 08 Jul 2022 17:19:32 -0400
From: James Alexander Clark <jaclark@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Understanding condor_userlog output for self-checkpointing apps

Thanks, Todd

On 7/8/22 15:34, Todd L Miller wrote:

I have an application which is configured to checkpoint and self-exitevery 60 minutes.
I am confused by the output of condor_userlog (see below):
ÂÂÂÂNothing in the rest of HTCondor is (presently) expected to knowanything at all about what self-checkpointing jobs are doing.Â We'llprobably fix that at some point.Â As far as I know, however, the CPUusage for each individual execution of the job should be correct, so areading of 0 seems like a problem.


FWIW, the job I happen to be looking at is still running with:

$ condor_q 58968166 -af RemoteUserCpu -af CumulativeRemoteUserCpu

107.0 31916.0

So I guess if the usage comes from the end-of-job summary in theuserlog, that's not too surprising. I'll try finding a completed one tocompare with.

- The wall times all being less than 2 hours seems suspicious to me:I'm guessing > 1 hour corresponds to cases where the job resumes onthe same host after a checkpoint?
ÂÂÂÂThe job will always _try_ to resume on the same host after acheckpoint.Â It should only switch hosts if it gets preempted (which Ithink you have turned off) or evicted for running too long or whatever,or if the execute node breaks in some way.

Right, and now I see the 4 hour continuous run in the Host/Job sectionso that makes sense.

Before we reconfigured to a 1 hour interval, we were running with thedefault 8 hours and saw wall times of that same order.Â Are we reallyjust getting unlucky here and getting evicted a few minutes after eachresume?
ÂÂÂÂI'd have to take a look at the actual job log; I suspectcondor_userlog is misunderstanding something.Â If the only change youmade to the application is how often it checkpoints -- and it's runningon the same machines, etc -- you shouldn't see any changes to how longit runs before being evicted.

I *think* a new issue might have manifested since I saw the 8 hour runs:we've got a few people with those socket disconnect messages every 50mins to 1.5 hours. I did track down some stuff in the shadowlog thatlooked a lot like:


https://lists.cs.wisc.edu/archive/htcondor-users/2011-June/msg00096.shtml

But that's totally unrelated to the userlog query and I'll start aseparate thread/issue when I've got a bit more data.


Thanks


- ToddM


--
James Alexander Clark
LIGO Laboratory
California Institute of Technology
email:  james.clark@xxxxxxxx
Tel. (cell):  413-230-1412

References:
- [HTCondor-users] Understanding condor_userlog output for self-checkpointing apps
  - From: James Alexander Clark
- Re: [HTCondor-users] Understanding condor_userlog output for self-checkpointing apps
  - From: Todd L Miller

Prev by Date: Re: [HTCondor-users] Understanding condor_userlog output for self-checkpointing apps
Next by Date: [HTCondor-users] Limit negotiation time per submitter
Previous by thread: Re: [HTCondor-users] Understanding condor_userlog output for self-checkpointing apps
Next by thread: [HTCondor-users] Limit negotiation time per submitter
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [HTCondor-users] Understanding condor_userlog output for self-checkpointing apps