[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] condor_rooster failing to crow



Hi Dan,

Thanks for the quick reply.  I think something is falling down the cracks somewhere.
In the negotiator log I see

01/12 15:48:11 Phase 3:  Sorting submitter ads by priority ...
01/12 15:48:11 Phase 4.1:  Negotiating with schedds ...
01/12 15:48:11     numSlots = 1
01/12 15:48:11     slotWeightTotal = 1.000000
01/12 15:48:11     pieLeft = 1.000
01/12 15:48:11     NormalFactor = 1.000000
01/12 15:48:11     MaxPrioValue = 0.500000
01/12 15:48:11     NumSubmitterAds = 1
01/12 15:48:11   Negotiating with smithic@xxxxxxxxx at <138.253.100.178:58887>
01/12 15:48:11 0 seconds so far
01/12 15:48:11   Calculating submitter limit with the following parameters
01/12 15:48:11     SubmitterPrio       = 0.500000
01/12 15:48:11     SubmitterPrioFactor = 1.000000
01/12 15:48:11     submitterShare      = 1.000000
01/12 15:48:11     submitterAbsShare   = 1.000000
01/12 15:48:11     submitterLimit    = 1.000000
01/12 15:48:11     submitterUsage    = 0.000000
01/12 15:48:11 Socket to smithic@xxxxxxxxx (<138.253.100.178:58887>) already in cache, reusing
01/12 15:48:11     Sending SEND_JOB_INFO/eom
01/12 15:48:11     Getting reply from schedd ...
01/12 15:48:11     Got JOB_INFO command; getting classad/eom
01/12 15:48:11     Request 00020.00000:
01/12 15:48:11 matchmakingAlgorithm: limit 1.000000 used 0.000000 pieLeft 1.000000
01/12 15:48:11       Rejected 20.0 smithic@xxxxxxxxx <138.253.100.178:58887>: no match found
01/12 15:48:11     Sending SEND_JOB_INFO/eom
01/12 15:48:11     Getting reply from schedd ...
01/12 15:48:11     Got NO_MORE_JOBS;  done negotiating
01/12 15:48:11   Submitter smithic@xxxxxxxxx got all it wants; removing it.

which seems to imply no match but when I use condor_q -ana it gives:

1 match but are currently offline

If I bring the machine on line then the job does indeed run.

any ideas ?

regards,

-ian.

> -----Original Message-----
> From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-
> bounces@xxxxxxxxxxx] On Behalf Of Dan Bradley
> Sent: 11 January 2010 17:02
> To: Condor-Users Mail List
> Subject: Re: [Condor-users] condor_rooster failing to crow
> 
> Ian,
> 
> Sorry to hear you are having difficulties.  If it is caused by a bug,
> I'll have to eat crow.  Here are some things to help see where it might
> be going wrong.
> 
> The setting of MachineLastMatchTime is initiated by the negotiator.
> With D_FULLDEBUG turned on, you should see a line like the following in
> your NegotiatorLog:
> 
> Registering attempt to match offline machine MACHINE by USER.
> 
> This results in a MERGE_STARTD_AD command being sent to the collector.
> If you have D_COMMAND turned on in the collector, you should see that
> command being received in CollectorLog.
> 
> After that command has been received, the machine ad should contain
> MachineLastMatchTime.  You should be able to see that with condor_status
> -long.
> 
> If something overwrites the offline machine ad, then
> MachineLastMatchTime will go away until the next time the negotiator
> sets it (i.e. the next negotiation cycle where a job matches the offline
> machine).
> 
> --Dan
> 
> Smith, Ian wrote:
> > Dear All,
> >
> > I'm trying to use condor_rooster in Condor 7.4 to work with our Windows XP pool
> > but with only limited success. To keep comaptibility with our current power saving
> > set up I'm trying to avoid using the Condor power saving and intead I'm publishing
> > the ClassAds of offline machine via a cron so that condor_rooster can wake up
> > the relevant ones.
> >
> > The crux of the matter seems to be in the UNHIBERNATE expression. In the
> documentation
> > (p 216) it states that the default value is MachineLastMatchTime =!= UNDEFINED
> although
> > I find that it is atually MY.MachineLastMatchTime =!= UNDEFINED. I've tried both
> and neither
> > seem to work as neither  MachineLastMatchTime nor  MY.MachineLastMatchTime
> seem
> > to be set. The manual says that
> >
> > "the special attribute MachineLastMatchTime is updated in the ClassAds of offline
> machines
> > when the job would have been matched to the machine if it had been online"
> >
> > but this doesn't seem to be happening. Using condor_q -ana reveals
> >
> > 019.009:  Run analysis summary.  Of 1 machines,
> >       0 are rejected by your job's requirements
> >       0 reject your job because of their own requirements
> >       0 match but are serving users with a better priority in the pool
> >       0 match but reject the job for unknown reasons
> >       0 match but will not currently preempt their existing job
> >       1 match but are currently offline
> >       0 are available to run your job
> >
> > so the matchmaking is definitely working - it just seems that the machine ClassAd
> isn't
> > updated. If I set MachineLastMatchTime to some arbitrary value myself then
> >
> > ROOSTER_UNHIBERNATE=Offline && Unhibernate
> >
> > seems to evaluate to TRUE and the wake up kicks in.
> >
> > I've tried D_FULLBEBUG but I still can't track down where the problem is.
> >
> > Any ideas ?
> >
> > regards,
> >
> > -ian.
> >
> >
> > --------------------------------------------
> > Dr Ian C. Smith,
> > e-Science Team,
> > The University of Liverpool,
> > Computing Services Departmen
> >
> > _______________________________________________
> > Condor-users mailing list
> > To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> > subject: Unsubscribe
> > You can also unsubscribe by visiting
> > https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> >
> > The archives can be found at:
> > https://lists.cs.wisc.edu/archive/condor-users/
> >
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/