[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Understanding condor_userlog output for self-checkpointing apps



I have an application which is configured to checkpoint and self-exit every 60 minutes.

I am confused by the output of condor_userlog (see below):

Nothing in the rest of HTCondor is (presently) expected to know anything at all about what self-checkpointing jobs are doing. We'll probably fix that at some point. As far as I know, however, the CPU usage for each individual execution of the job should be correct, so a reading of 0 seems like a problem.

- The wall times all being less than 2 hours seems suspicious to me: I'm guessing > 1 hour corresponds to cases where the job resumes on the same host after a checkpoint?

The job will always _try_ to resume on the same host after a checkpoint. It should only switch hosts if it gets preempted (which I think you have turned off) or evicted for running too long or whatever, or if the execute node breaks in some way.

Before we reconfigured to a 1 hour interval, we were running with the default 8 hours and saw wall times of that same order. Are we really just getting unlucky here and getting evicted a few minutes after each resume?

I'd have to take a look at the actual job log; I suspect condor_userlog is misunderstanding something. If the only change you made to the application is how often it checkpoints -- and it's running on the same machines, etc -- you shouldn't see any changes to how long it runs before being evicted.

- ToddM