[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Understanding condor_userlog output for self-checkpointing apps



Thanks, Todd

On 7/8/22 15:34, Todd L Miller wrote:
I have an application which is configured to checkpoint and self-exit every 60 minutes.

I am confused by the output of condor_userlog (see below):

ÂÂÂÂNothing in the rest of HTCondor is (presently) expected to know anything at all about what self-checkpointing jobs are doing. We'll probably fix that at some point. As far as I know, however, the CPU usage for each individual execution of the job should be correct, so a reading of 0 seems like a problem.

FWIW, the job I happen to be looking at is still running with:

$ condor_q 58968166 -af RemoteUserCpu -af CumulativeRemoteUserCpu

107.0 31916.0

So I guess if the usage comes from the end-of-job summary in the userlog, that's not too surprising. I'll try finding a completed one to compare with.



- The wall times all being less than 2 hours seems suspicious to me: I'm guessing > 1 hour corresponds to cases where the job resumes on the same host after a checkpoint?

ÂÂÂÂThe job will always _try_ to resume on the same host after a checkpoint. It should only switch hosts if it gets preempted (which I think you have turned off) or evicted for running too long or whatever, or if the execute node breaks in some way.

Right, and now I see the 4 hour continuous run in the Host/Job section so that makes sense.


Before we reconfigured to a 1 hour interval, we were running with the default 8 hours and saw wall times of that same order. Are we really just getting unlucky here and getting evicted a few minutes after each resume?

ÂÂÂÂI'd have to take a look at the actual job log; I suspect condor_userlog is misunderstanding something. If the only change you made to the application is how often it checkpoints -- and it's running on the same machines, etc -- you shouldn't see any changes to how long it runs before being evicted.

I *think* a new issue might have manifested since I saw the 8 hour runs: we've got a few people with those socket disconnect messages every 50 mins to 1.5 hours. I did track down some stuff in the shadowlog that looked a lot like:

https://lists.cs.wisc.edu/archive/htcondor-users/2011-June/msg00096.shtml

But that's totally unrelated to the userlog query and I'll start a separate thread/issue when I've got a bit more data.

Thanks


- ToddM

--
James Alexander Clark
LIGO Laboratory
California Institute of Technology
email:  james.clark@xxxxxxxx
Tel. (cell):  413-230-1412