[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] classad for wall-time

On Wednesday, April 27, 2011 at 5:27 PM, Santanu Das wrote:
Thanks Ian! I think I see things a bit different here. First of all, jobs never get preempted here, so always run on a single machine.
Yes, but that doesn't mean they've only run once, or even at all. A user could have set  an on_exit_remove _expression_ that caused a job to run more than once. Or a user may have removed jobs before they were ever run.

(EnteredCurrentStatus - JobCurrentStartDate) is always higher than 'RemoteWallClockTime' here. Say for this month, this is what I see:

[root@serv07 ~]# condor_history -c 'formatTime(EnteredCurrentStatus, "%m") == "04" && \
AccountingGroup =!= UNDEFINED' -format "%s " AccountingGroup -format "%d\n" \
'EnteredCurrentStatus - JobCurrentStartDate' | sed 's/\(.*\)\..* \(.*\)/\1 \2/' | \
awk '{sums[$1] = $2 + ($1 in sums ? sums[$1] : 0)} END {for (x in sums) print x,sums[x]}'
group_alice 608721
group_euindia 2092416
group_calice 150833
group_monitor 452541
group_atlas 278858255
group_lhcb 85749166
group_cms 2825308

[root@serv07 ~]# condor_history -c 'formatTime(EnteredCurrentStatus, "%m") == "04" && AccountingGroup =!= UNDEFINED' \
-format "%s " AccountingGroup -format "%d\n" 'RemoteWallClockTime' | sed 's/\(.*\)\..* \(.*\)/\1 \2/' | \
awk '{sums[$1] = $2 + ($1 in sums ? sums[$1] : 0)} END {for (x in sums) print x,sums[x]}'
group_alice 49846
group_euindia 2111122
group_calice 150833
group_monitor 104893
group_atlas 125423594
group_lhcb 81987039
group_cms 2599357

According to your exploitation, I assumed that RemoteWallClockTime would be either higher or equal to (EnteredCurrentStatus - JobCurrentStartDate) but that's not the case here. Am I seeing the correct thing?
You're assuming every job in the history file ran and completed successfully after only one run attempt. But that isn't necessarily true. Jobs in the history file may not have run at all; they may have gone from idle to completed (i.e. the user removed them) and in that case JobCurrentStartDate can't be relied upon. At best it's going to be zero. At worst, well...who knows. That means the _expression_ (EnteredCurrentStatus - JobCurrentStartDate) could end up being (EnteredCurrentStatus - 0) = EnteredCurrentStatus. Ouch.

If you're trying to account for time spent by user's jobs on machines and build reports definitely go with RemoteWallClock time. Especially if you don't want to have to write logic to deal with jobs running more or less than exactly one time (which can happen even if you're not using preemption in the system). The number should always be sane, whether the job ran no times, one time or 40 times.

It also doesn't suffer from clock drift issues. Time stamp values aren't always being set from the same machine. Some might be set by the shadow on your scheduler, some might be set by the startd. And if you've got clock drift between the machines, you'll end up with inconsistencies.

Hope that helps.

- Ian

Ian Chesal