[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Problem with schedd ad ?



Thanks for the answers,

Le vendredi 06 janvier 2006 à 19:02 -0600, Derek Wright a écrit :
> On Fri, 06 Jan 2006 11:04:07 +0100  Jean-Christophe BACCON wrote:
> 
> > With condor 6.7.10, I have the following error message in my negociator
> > logs :
> ...
> > 1/5 18:36:40 Phase 4.1:  Negotiating with schedds ...
> > 1/5 18:36:40   Error!  Could not get Name and ScheddIpAddr from ad
> > 1/5 18:36:40 ---------- Finished Negotiation Cycle ----------
> > 
> > This message is repeated all time and no more job goes in RUN state (but
> > previously running jobs continue normally).
> 
> weird.  we saw the same bug.  i fixed the negotiator in 6.7.14 (sorry
> it's not in the version history... it's also fixed in the forthcoming
> 6.6.11 release, and we still don't have a good system for documenting
> bug fixes that happen in multiple releases) so that when this happens,
> it doesn't abort the entire negotiation cycle, it just ignores the
> badly formed schedd classad and tries to negotiate with other schedds.
> so, if you upgrade your central manager to 6.7.14, when you have this
> problem, at least it won't prevent other schedds from being able to
> negotiate and run jobs.


So, if one schedd had a badly formed schedd classad, there is no
negociations for all users of this schedd ? even with 6.7.14 ? cause I
have only one schedd in my pool :(


> 
> however, we were never able to reproduce the problem that was causing
> the schedd ads to show up like this in the first place.  i have some
> suspicions, because the code in the schedd responsbile for generating
> these classads is a mess and it needs to be re-written (this has been
> on our development to-do list for quite some time).  so, instead of
> trying to really analyze what's causing this bug, we decided to just
> fix the negotiator so it's not such a catastrophic failure when it
> happens, re-write the schedd's code that's generating the ads, and
> hope that the problem goes away once we clean everything up.


What could I do to give you some information on this bug ? I haven't
restart any daemons just reconfig the schedd. And that's not the first
time I saw this bug in my pool.


> 
> > But I have an "unexpanded" job in my queue :
> ...
> > 104 jobs; 5 idle, 98 running, 0 held, 1 unexpanded
> > 
> > What does this mean ?
> 
> long ago, we distinguished between jobs that have never run
> (unexpanded) and jobs that tried to run at least once but are
> currently not running (idle).  so, when you first submitted jobs to
> condor, they used to show up in the queue with status "U"
> (unexpanded), and only would be "I" (idle) once they had started
> running somewhere and were then evicted for some reason.  however, we
> haven't used this "U" state in ages, so i don't know why condor_q is
> telling you that one of your jobs is unexpanded... that's pretty
> weird.

I saw this in documentation of old versions of condor. I think this
problem is related with the bad formed classad bug, cause it happened
too the last time I saw this bad formed classad bug.
The last time (but not all time), this bug appeared, I have restarted
the Schedd and it "seg-fault" while it was stopping. When Schedd
restarted half of the jobs were missing in the queue.

>  
> 
> > What is the problem ?
> 
> unfortunately, i don't know.  i know the solution will be a newer
> version of the condor_schedd, but i can't say exactly when we're going
> to have a chance to fix this stuff.  certainly before 6.8.0, but i
> don't know exactly what 6.7.x release it'll show up in.
> 
> sorry i can't be more help,

No problem, tell me if I could help you.

> -derek
> 
> 
> 
> _______________________________________________
> Condor-users mailing list
> Condor-users@xxxxxxxxxxx
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users