[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Duplicate ClusterIds in condor_history



On 5/3/06, Preston Smith <psmith@xxxxxxxxxx> wrote:
Dan,

Yes, it was restarted right when it started reusing ClusterIds.

I see the file...  I stopped condor, edited job_queue.log, and fixed a
line like

103 0.0 NextClusterNum 4833

replacing 4833 with what the ClusterId *should* be at... That seems to
do it, I submitted a test job and numbering is back where it ought to.

It is inevitable that cluster ids on a schedd will not be absolutely
unique - assuming they will be is always going to cause you grief (at
least with the current condor versions).

They are simply guaranteed to be unique while they are in the queue
(so they won't get reused - which would obviously break things).

If you want a true globally unique id for a cluster (or indeed a job
within one) you have to add it yourself. I auto add new guids to all
my jobs but I have the luxury of automated submit script writing.

Technically if you ensure that the job log is never altered/lost then
a combination of schedd name and cluster/jobid is likely to remain
unique for some time but an obvious way this will break is if your
submit machine dies, you replace it with a new machine which is
renamed to look just like the old one but you lost your job queue.
unless you take manual action to change the queue file to hack the
cluster id back up it will reset to zero.

The safest way to use cluster ids is to *not* assume they are unique
across time, simply that they are unique for the lifetime of the
cluster itself.* Any thing else you need regarding uniqueness of
identification you will have to add yourself.

Matt

* Note that this means that, if you submit the job in such a way that
you know it's cluster id then any programmatic use of that cluster id
on that schedd while you know that the job is still alive will be
fine. This of course needs some nice handling of a terminal failure
such that your schedd gets hosed but comes back up starting from zero
again. Since this is very unlikely without some user input then you
aren't taking too big a risk.

But if you are programmatically submitting and controlling jobs you
should just bite the bullet and put your own unique id in.