[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Completed job with no history file



On 10/20/2020 9:47 AM, Stefano Dal Pra wrote:
Hello, condor 8.8.9 speaking

I noticed recently that there are done jobs which seem to disappear from the point of view of condor_history,
also leaving no history log file under PER_JOB_HISTORY_DIR.

Jobs do not enter into the history file(s) when they are completed, they enter the history file(s) when they leave the schedd database.

If you can see the job with condor_q, you will not see it with condor_history.  And vice versa.

By default jobs are removed from the schedd whenever they enter the completed state (JobStatus==4) or removed state (JobStatus==3). 

However this can be customized via the the "leave_in_queue" statement in the job submit file.  See the condor_submit man page for details.

Looks like at your site something is setting leave_in_queue as follows, which means the job will stay in the schedd for 10 days in completed state,
and then after 10 days it will be written into the history file(s):

  LeaveJobInQueue = JobStatus == 4 && (CompletionDate =?= undefined || CompletionDate == 0 || ((time() - CompletionDate) < 864000))

Hope the above helps,
Todd






One example. This job completed apparently with no errors after running for ~ 26K seconds:

[root@sn-01 ~]# condor_q -name sn-01 9865068.0 -af:jln LastJobStatus JobStatus AcctGroup LastRemoteHost CpusProvisioned CumulativeRemoteUserCpu RemoteWallClockTime ExitBySignal ExitCode ExitStatus 'abstime(JobStartDate)' 'abstime(JobCurrentStartTransferOutputDate)' NumJobStarts NumJobCompletions ResidentSetSize_RAW 'abstime(x509UserProxyExpiration)' 'abstime(CompletionDate)'
ID = 9865068.0
 LastJobStatus = 2
 JobStatus = 4
 AcctGroup = virgo
 LastRemoteHost = slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
 CpusProvisioned = 2
 CumulativeRemoteUserCpu = 9302.0
 RemoteWallClockTime = 26025.0
 ExitBySignal = false
 ExitCode = 0
 ExitStatus = 0
 abstime(JobStartDate) = absTime("2020-10-17T01:58:58+02:00")
 abstime(JobCurrentStartTransferOutputDate) = absTime("2020-10-17T09:12:42+02:00")
 NumJobStarts = 1
 NumJobCompletions = 1
 ResidentSetSize_RAW = 4461780
 abstime(x509UserProxyExpiration) = absTime("2020-10-17T12:11:11+02:00")
 abstime(CompletionDate) = absTime("2020-10-17T09:12:43+02:00")


However:
[root@sn-01 ~]# condor_history -lim 1 -name sn-01 9865068.0
 ID     OWNER          SUBMITTED   RUN_TIME     ST COMPLETED CMD

Finally,
I assume an history job log file existing under $(PER_JOB_HISTORY_DIR).
Several files are there, but there is none (and other alike).

Any idea?
Thanks
Stefano


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


-- 
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing   Department of Computer Sciences
HTCondor Technical Lead                1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132                  Madison, WI 53706-1685