[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Completed job with no history file



Sorry, cut and paste error it should be this

 

leave_in_queue = JobStatus == 4 && (CompletionDate =?= UNDEFINED || CompletionDate == 0 || ((time() - CompletionDate) < 172800))

 

 

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of John M Knoeller
Sent: Monday, October 26, 2020 10:03 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>; Todd Tannenbaum <tannenba@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Completed job with no history file

 

Jobs submitted with -spool will get a default for LeaveJobInQueue only if the submit file did not specify one,

so the way to override this is to add this line to your submit file

 

leave_in_queue = JobStatus == 4 && (CompletionDate =?= UNDEFINED || CompletionDate == 0 || ((time() - %s) < 172800))

 

If you donât have control over the submit file, then you can do this with a job transform.  -spool will submit the job on hold

with a special hold reason code. so we can use that to make the transform apply only to jobs submitted with -spool

 

JOB_TRANSFORM_InQueueTwoDays @=end

  REQUIREMENTS HoldReasonCode == 16

   SET LeaveJobInQueue JobStatus == 4 && (CompletionDate is undefined || CompletionDate == 0 || ((time() - CompletionDate) < 172800))
@end

 

-tj

 

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Stefano Dal Pra
Sent: Thursday, October 22, 2020 2:51 PM
To: Todd Tannenbaum <tannenba@xxxxxxxxxxx>; HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Completed job with no history file

 


Thank You Todd, You were right.

It turns out that these jobs are submitted using -spool, which implies:
<<For this case, the default _expression_ causes the job to be kept in the queue for 10 days after completion.>>

and in fact:
LeaveJobInQueue is set set to

   JobStatus == 4 && (CompletionDate is undefined || CompletionDate == 0 || ((time() - CompletionDate) < 864000))

Apparently, the "10 days quarantine" cannot be altered, as the 864000 seems to be an "hardcoded" value (is it?)
but i wanted to shorten it to two days, so i wrote the following Job Transform rule:

JOB_TRANSFORM_InQueueTwoDays @=end
   REQUIREMENTS True
   if RegExp(" < 864000",unparse(LeaveJobInQueue))
      SET LeaveJobInQueue "JobStatus == 4 && (CompletionDate is undefined || CompletionDate == 0 || ((time() - CompletionDate) < 172800))"
   endif
@end

Which seems to work, however i don't like much the
if RegExp(" < 864000",unparse(LeaveJobInQueue))
part. Maybe i'm just missing a simpler check? 

Thanks again
Stefano

On 20/10/20 22:14, Todd Tannenbaum wrote:

On 10/20/2020 9:47 AM, Stefano Dal Pra wrote:

Hello, condor 8.8.9 speaking

I noticed recently that there are done jobs which seem to disappear from the point of view of condor_history,
also leaving no history log file under PER_JOB_HISTORY_DIR.


Jobs do not enter into the history file(s) when they are completed, they enter the history file(s) when they leave the schedd database.

If you can see the job with condor_q, you will not see it with condor_history.  And vice versa.

By default jobs are removed from the schedd whenever they enter the completed state (JobStatus==4) or removed state (JobStatus==3). 

However this can be customized via the the "leave_in_queue" statement in the job submit file.  See the condor_submit man page for details.

Looks like at your site something is setting leave_in_queue as follows, which means the job will stay in the schedd for 10 days in completed state,
and then after 10 days it will be written into the history file(s):

  LeaveJobInQueue = JobStatus == 4 && (CompletionDate =?= undefined || CompletionDate == 0 || ((time() - CompletionDate) < 864000))

Hope the above helps,
Todd






One example. This job completed apparently with no errors after running for ~ 26K seconds:

[root@sn-01 ~]# condor_q -name sn-01 9865068.0 -af:jln LastJobStatus JobStatus AcctGroup LastRemoteHost CpusProvisioned CumulativeRemoteUserCpu RemoteWallClockTime ExitBySignal ExitCode ExitStatus 'abstime(JobStartDate)' 'abstime(JobCurrentStartTransferOutputDate)' NumJobStarts NumJobCompletions ResidentSetSize_RAW 'abstime(x509UserProxyExpiration)' 'abstime(CompletionDate)'
ID = 9865068.0
 LastJobStatus = 2
 JobStatus = 4
 AcctGroup = virgo
 LastRemoteHost = slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
 CpusProvisioned = 2
 CumulativeRemoteUserCpu = 9302.0
 RemoteWallClockTime = 26025.0
 ExitBySignal = false
 ExitCode = 0
 ExitStatus = 0
 abstime(JobStartDate) = absTime("2020-10-17T01:58:58+02:00")
 abstime(JobCurrentStartTransferOutputDate) = absTime("2020-10-17T09:12:42+02:00")
 NumJobStarts = 1
 NumJobCompletions = 1
 ResidentSetSize_RAW = 4461780
 abstime(x509UserProxyExpiration) = absTime("2020-10-17T12:11:11+02:00")
 abstime(CompletionDate) = absTime("2020-10-17T09:12:43+02:00")


However:
[root@sn-01 ~]# condor_history -lim 1 -name sn-01 9865068.0
 ID     OWNER          SUBMITTED   RUN_TIME     ST COMPLETED CMD

Finally,
I assume an history job log file existing under $(PER_JOB_HISTORY_DIR).
Several files are there, but there is none (and other alike).

Any idea?
Thanks
Stefano


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/



-- 
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing   Department of Computer Sciences
HTCondor Technical Lead                1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132                  Madison, WI 53706-1685