[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] condor_rooster failing to crow



This program only keep ocupated your brain

Beelivmee!


----- Original Message ----
From: "Smith, Ian" <I.C.Smith@xxxxxxxxxxxxxxx>
To: Condor-Users Mail List <condor-users@xxxxxxxxxxx>
Sent: Wed, February 3, 2010 6:12:58 PM
Subject: Re: [Condor-users] condor_rooster failing to crow

Hi Dan,

Thanks for the reply. In the interests of expediency I've not
really debugged this any further. As I'm not planning to use
the power saving feature of Condor on the execute hosts
the offline classad bit is a bit of a moot point. HOWEVER
I have managed to publish my own classads based on those
of machines in the pool with a fairly minimal set of attributes
and the matchmaking (and condor_rooster wake up) 
now seems fine. I just need to try it out - tentatively
at first - on the production service. 

One largish pitfall I've come across is in regard to the timestamp
attributes (ClockMin and ClockDay) in the startd ads. I would
imagine if an execute host is offline for any appreciable time
then these values will be stale. This could cause problems
where the Start expression depends on ClockMin and ClockDay
(in order to enforce availability policy e.g. office hours  / out-of-hours).
My guess is that the central manager would detect a match and 
try to start the job but it would then fail as the startd would enforce
the actual Start policy based on it's timestamp.

A possible way around this I'm thinking of is to update the
ClockMin and ClockDay attributes periodically via a cron.

thanks for the advice,

will report on the full story when (if ?!) I get it all working,

regards,

-ian.

> -----Original Message-----
> From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-
> bounces@xxxxxxxxxxx] On Behalf Of Dan Bradley
> Sent: 02 February 2010 23:02
> To: Condor-Users Mail List
> Subject: Re: [Condor-users] condor_rooster failing to crow
> 
> Hi Ian,
> 
> Sorry for the late response.
> 
> It is expected that the 7.2.1 condor startd would advertise an
> unmatchable ClassAd when going offline (Requirements=False).
> 
> It is not expected that the 7.4.0 startd offline ad would be
> unmatchable.  I have tested this functionality under unix and I have not
> found anything wrong with the offline ad.  It is possible that something
> is different in a windows environment.  Do you happen to have a copy of
> the bad offline ad, before you edited it?
> 
> When an offline machine is matched, there should be a message such as
> the following in the negotiator log with D_FULLDEBUG:
> 
> 02/02 16:55:58 Registering attempt to match offline machine
> slot2@xxxxxxxxxxxxxxxxxx by dan.
> 
> If nothing but offline slots are available for a job, it will still be
> reported as "rejected" by the matchmaker with "no match found".
> However, the offline ad in the collector should have a new attribute
> MachineLastMatchTime indicating that the negotiator would have matched a
> job to the machine.
> 
> Thanks,
> --Dan
> 
> Smith, Ian wrote:
> > In the interests of completeness I thought I'd better follow this up so say
> > that I think I may have found where the problem lies.
> >
> > I've tried running Condor 7.4.0 on the central manager (Solaris 10) and
> > Condor 7.2.1 on the execute host (Win XP). When the PC goes into
> > hibernation a ClassAd is sent to the manager and recorded in the offline.log.
> > When I look at the negotiator log I see that there is no match with
> > queued jobs but - and this is a big BUT - it looks like Requirements
> > is set to FALSE (using condor_status -l). If I merge in another ClassAd
> > (with MERGE_STARTD_AD) to set Requirements to a more sensible value
> > then the matchmaking works. I found that I also had to set the Unhibernate
> > expression explicity in the offline ClassAd (Unhibernate =  LastMatchTime =!=
> UNDEFINED)
> > and then bingo condor_rooster does infact wake up the PC.
> >
> > Moving to Condor 7.4.0 on the execute host threw up some more suprises.
> > I found that although the machine did not power down a ClassAd was
> > sent out when it should have done and this was recorded in the offline.log.
> > I stopped Condor on the execute host and advertised the offline ClassAd
> > myself. Again the matchmaking failed and it was only after I removed a
> > lot of the attributes that it worked. I've listed a working ClassAd below.
> >
> > regards,
> >
> > -ian.
> >
> > Offline = TRUE
> > MyType = "Machine"
> > ClassAdLifetime = 2000000
> > TargetType = "Job"
> > Name = "ADMN10-84463C.livad.liv.ac.uk"
> > Rank = 0.000000
> > CpuBusy = ((LoadAvg - CondorLoadAvg) >= 0.500000)
> > Unhibernate = MachineLastMatchTime =!= UNDEFINED
> > MyCurrentTime = 1263223234
> > Machine = "ADMN10-84463C.livad.liv.ac.uk"
> > PublicNetworkIpAddr = "<138.253.103.228:3578>"
> > CSD_CONDOR_POOL = "TEST"
> > CSD_REVISION = "23_JULY_2008"
> > CSD_HAS_SSL = TRUE
> > COLLECTOR_HOST_STRING = "ulgp3.liv.ac.uk"
> > CondorVersion = "$CondorVersion: 7.4.0 Oct 31 2009 BuildID: 193173 $"
> > CondorPlatform = "$CondorPlatform: INTEL-WINNT50 $"
> > SlotID = 1
> > VirtualMachineID = 1
> > VirtualMemory = 3151356
> > TotalDisk = 6430264
> > Disk = 6430264
> > CondorLoadAvg = 0.000000
> > LoadAvg = 0.060000
> > KeyboardIdle = 304
> > ConsoleIdle = 304
> > Memory = 2045
> > Cpus = 1
> > StartdIpAddr = "<138.253.103.228:3578>"
> > Arch = "INTEL"
> > OpSys = "WINNT51"
> > UidDomain = "liv.ac.uk"
> > FileSystemDomain = "ADMN10-84463C.livad.liv.ac.uk"
> > HasIOProxy = TRUE
> > CheckpointPlatform = "WINNT51 INTEL Unknown normal"
> > WindowsMajorVersion = 5
> > WindowsMinorVersion = 1
> > WindowsBuildNumber = 2600
> > WindowsServicePackMajorVersion = 2
> > WindowsServicePackMinorVersion = 0
> > WindowsProductType = 1
> > TotalVirtualMemory = 3151356
> > TotalCpus = 1
> > TotalMemory = 2045
> > KFlops = 769806
> > Mips = 4037
> > LastBenchmark = 1263222930
> > TotalLoadAvg = 0.060000
> > TotalCondorLoadAvg = 0.000000
> > ClockMin = 920
> > ClockDay = 1
> > TotalSlots = 1
> > TotalVirtualMachines = 1
> > HasFileTransfer = TRUE
> > HasPerFileEncryption = TRUE
> > HasReconnect = TRUE
> > HasMPI = TRUE
> > HasTDP = TRUE
> > HasJobDeferral = TRUE
> > HasJICLocalConfig = TRUE
> > HasJICLocalStdin = TRUE
> > HasWindowsRunAsOwner =
> TRUE,HasJICLocalConfig,HasJICLocalStdin,HasVM,HasWindowsRunAsOwner"
> > HasVM = FALSE
> > HibernationLevel = 0
> > HibernationState = "NONE"
> > CanHibernate = TRUE
> > HardwareAddress = "00:1A:A0:BE:27:01"
> > IsWakeOnLanSupported = TRUE
> > IsWakeOnLanEnabled = TRUE
> > IsWakeAble = TRUE
> > WakeOnLanSupportedFlags = "Magic Packet"
> > WakeOnLanEnabledFlags = "Magic Packet"
> > CpuBusyTime = 0
> > CpuIsBusy = FALSE
> > TimeToLive = 2147483647
> > State = "Unclaimed"
> > EnteredCurrentState = 1263222973
> > Activity = "Idle"
> > EnteredCurrentActivity = 1263222973
> > TotalTimeOwnerIdle = 10
> > TotalTimeUnclaimedIdle = 43
> > TotalTimeClaimedBusy = 261
> > Start = (Owner == "smithic")
> > Requirements = TRUE
> > IsValidCheckpointPlatform = (((TARGET.JobUniverse == 1) == FALSE) ||
> ((MY.CheckpointPlatform =!= UND
> > EFINED) && ((TARGET.LastCheckpointPlatform =?= MY.CheckpointPlatform) ||
> (TARGET.NumCkpts == 0))))
> > MaxJobRetirementTime = 0
> > LastFetchWorkSpawned = 0MaxJobRetirementTime = 0
> > LastFetchWorkSpawned = 0
> > LastFetchWorkCompleted = 0
> > NextFetchWorkDelay = -1
> > CurrentRank = 0.000000
> > MonitorSelfTime = 1263223170
> > MonitorSelfCPUUsage = 0.214469
> > MonitorSelfImageSize = 46532.000000
> > MonitorSelfResidentSetSize = 9976
> > MonitorSelfAge = 250
> > MonitorSelfRegisteredSocketCount = 2
> > MyAddress = "<138.253.103.228:3578>"
> > LastHeardFrom = 1263223234
> > LastFetchWorkCompleted = 0
> > StarterAbilityList =
> "HasFileTransfer,HasPerFileEncryption,HasReconnect,HasMPI,HasTDP,HasJobDefer
> ral
> >
> >
> >
> >
> >
> >
> >> -----Original Message-----
> >> From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-
> >> bounces@xxxxxxxxxxx] On Behalf Of Dan Bradley
> >> Sent: 11 January 2010 17:02
> >> To: Condor-Users Mail List
> >> Subject: Re: [Condor-users] condor_rooster failing to crow
> >>
> >> Ian,
> >>
> >> Sorry to hear you are having difficulties.  If it is caused by a bug,
> >> I'll have to eat crow.  Here are some things to help see where it might
> >> be going wrong.
> >>
> >> The setting of MachineLastMatchTime is initiated by the negotiator.
> >> With D_FULLDEBUG turned on, you should see a line like the following in
> >> your NegotiatorLog:
> >>
> >> Registering attempt to match offline machine MACHINE by USER.
> >>
> >> This results in a MERGE_STARTD_AD command being sent to the collector.
> >> If you have D_COMMAND turned on in the collector, you should see that
> >> command being received in CollectorLog.
> >>
> >> After that command has been received, the machine ad should contain
> >> MachineLastMatchTime.  You should be able to see that with condor_status
> >> -long.
> >>
> >> If something overwrites the offline machine ad, then
> >> MachineLastMatchTime will go away until the next time the negotiator
> >> sets it (i.e. the next negotiation cycle where a job matches the offline
> >> machine).
> >>
> >> --Dan
> >>
> >> Smith, Ian wrote:
> >>
> >>> Dear All,
> >>>
> >>> I'm trying to use condor_rooster in Condor 7.4 to work with our Windows XP pool
> >>> but with only limited success. To keep comaptibility with our current power
> saving
> >>> set up I'm trying to avoid using the Condor power saving and intead I'm
> publishing
> >>> the ClassAds of offline machine via a cron so that condor_rooster can wake up
> >>> the relevant ones.
> >>>
> >>> The crux of the matter seems to be in the UNHIBERNATE expression. In the
> >>>
> >> documentation
> >>
> >>> (p 216) it states that the default value is MachineLastMatchTime =!=
> UNDEFINED
> >>>
> >> although
> >>
> >>> I find that it is atually MY.MachineLastMatchTime =!= UNDEFINED. I've tried
> both
> >>>
> >> and neither
> >>
> >>> seem to work as neither  MachineLastMatchTime nor
> MY.MachineLastMatchTime
> >>>
> >> seem
> >>
> >>> to be set. The manual says that
> >>>
> >>> "the special attribute MachineLastMatchTime is updated in the ClassAds of
> offline
> >>>
> >> machines
> >>
> >>> when the job would have been matched to the machine if it had been online"
> >>>
> >>> but this doesn't seem to be happening. Using condor_q -ana reveals
> >>>
> >>> 019.009:  Run analysis summary.  Of 1 machines,
> >>>       0 are rejected by your job's requirements
> >>>       0 reject your job because of their own requirements
> >>>       0 match but are serving users with a better priority in the pool
> >>>       0 match but reject the job for unknown reasons
> >>>       0 match but will not currently preempt their existing job
> >>>       1 match but are currently offline
> >>>       0 are available to run your job
> >>>
> >>> so the matchmaking is definitely working - it just seems that the machine
> ClassAd
> >>>
> >> isn't
> >>
> >>> updated. If I set MachineLastMatchTime to some arbitrary value myself then
> >>>
> >>> ROOSTER_UNHIBERNATE=Offline && Unhibernate
> >>>
> >>> seems to evaluate to TRUE and the wake up kicks in.
> >>>
> >>> I've tried D_FULLBEBUG but I still can't track down where the problem is.
> >>>
> >>> Any ideas ?
> >>>
> >>> regards,
> >>>
> >>> -ian.
> >>>
> >>>
> >>> --------------------------------------------
> >>> Dr Ian C. Smith,
> >>> e-Science Team,
> >>> The University of Liverpool,
> >>> Computing Services Departmen
> >>>
> >>> _______________________________________________
> >>> Condor-users mailing list
> >>> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> >>> subject: Unsubscribe
> >>> You can also unsubscribe by visiting
> >>> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> >>>
> >>> The archives can be found at:
> >>> https://lists.cs.wisc.edu/archive/condor-users/
> >>>
> >>>
> >> _______________________________________________
> >> Condor-users mailing list
> >> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> >> subject: Unsubscribe
> >> You can also unsubscribe by visiting
> >> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> >>
> >> The archives can be found at:
> >> https://lists.cs.wisc.edu/archive/condor-users/
> >>
> > _______________________________________________
> > Condor-users mailing list
> > To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> > subject: Unsubscribe
> > You can also unsubscribe by visiting
> > https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> >
> > The archives can be found at:
> > https://lists.cs.wisc.edu/archive/condor-users/
> >
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/