[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] lost jobs

On Apr 8, 2011, at 7:54 PM, Santanu Das wrote:

I see there are number of jobs, once submitted in the queue (and eventually failed), they are logged in the job_queue.log but condor_history knows nothing about them. Here is one example for: ClusterId = 604510
[root@serv07 spool]# cat job_queue.log | sed -n -e '/ClusterId 604510/{x;p;g;$N;N;N;N;N;p;D}'

103 0604510.-1 ClusterId 604510
103 0604510.-1 QDate 1302172359
103 0604510.-1 CompletionDate 0
103 0604510.-1 User "pltlhc15@xxxxxxxxxxxxxxxxxxxxxxxx"
103 0604510.-1 Owner "pltlhc15"
[root@serv07 spool]# condor_history 604510 && date
 ID      OWNER            SUBMITTED     RUN_TIME ST   COMPLETED CMD            
Thu Apr  7 15:23:26 BST 2011

Does any one know why I'm seeing this?

Do you job history enabled? If 'condor_config_val HISTORY' responds with 'Not defined: HISTORY', then Condor doesn't keep a history of old jobs.

Yes, I do have job history enabled.

When Condor is keeping a history of old jobs, jobs will be dropped from the history when enough later jobs leave the queue. If you want the history to go back further in the past, you can adjust ENABLE_HISTORY_ROTATION, MAX_HISTORY_LOG, and MAX_HISTORY_ROTATIONS.

If you have a look at the QDate in the job_queue.log, it's same day when the job was submitted and quired with condor_history:
[testac1@serv07 ~]$ date -d@1302172359
Thu Apr  7 11:32:39 BST 2011

I keep 30 [MAX_]HISTORY_LOG, so it's enough for a day's info to stay in the history log.

Did these jobs ever appear in a run of condor_q? Is there any sign the jobs were ever executed? Did the schedd have to restart at any point?

Thanks and regards,
Jaime Frey
UW-Madison Condor Team