[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] duplicate jobIDs in the condor_history



On Wed, Nov 24, 2010 at 4:34 PM, Santanu Das <santanu@xxxxxxxxxxxxxxxxx> wrote:
Hi all,

A see lots of lots of jobs are running with duplicate jobIDs. At the time of writing, it's almost 700 of them:

[root@serv07 ~]# condor_history | awk '{ print $1 }' | sort | uniq -d | wc -l
684
and it's growing in number in every hour, which is putting us in great  trouble debugging some of the issues we have here.
Is it a bug?

Not really. Condor doesn't garuntee that cluster IDs will be unique for a scheduler for all time. If you delete the $(SPOOL) directory or even just the job_queue.log file for a scheduler you'll have your cluster IDs reset.

So the first question is:

Did you delete the $(SPOOL) directory for the scheduler or the contents of that directory or the job_queue.log files? If so, you reset the the cluster ID counter and that's why you've got duplicates.

If you're certain you haven't wiped the job_queue.log file for the scheduler, is it possible you have multiple schedulers writing to the same history file? If so: that's bad. Each scheduler should have its own history file.

- Ian