[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] condor_rooster failing to crow



In the interests of completeness I thought I'd better follow this up so say
that I think I may have found where the problem lies.

I've tried running Condor 7.4.0 on the central manager (Solaris 10) and
Condor 7.2.1 on the execute host (Win XP). When the PC goes into
hibernation a ClassAd is sent to the manager and recorded in the offline.log.
When I look at the negotiator log I see that there is no match with 
queued jobs but - and this is a big BUT - it looks like Requirements
is set to FALSE (using condor_status -l). If I merge in another ClassAd
(with MERGE_STARTD_AD) to set Requirements to a more sensible value 
then the matchmaking works. I found that I also had to set the Unhibernate
expression explicity in the offline ClassAd (Unhibernate =  LastMatchTime =!= UNDEFINED)
and then bingo condor_rooster does infact wake up the PC.

Moving to Condor 7.4.0 on the execute host threw up some more suprises.
I found that although the machine did not power down a ClassAd was
sent out when it should have done and this was recorded in the offline.log.
I stopped Condor on the execute host and advertised the offline ClassAd 
myself. Again the matchmaking failed and it was only after I removed a
lot of the attributes that it worked. I've listed a working ClassAd below.

regards,

-ian.

Offline = TRUE
MyType = "Machine"
ClassAdLifetime = 2000000
TargetType = "Job"
Name = "ADMN10-84463C.livad.liv.ac.uk"
Rank = 0.000000
CpuBusy = ((LoadAvg - CondorLoadAvg) >= 0.500000)
Unhibernate = MachineLastMatchTime =!= UNDEFINED
MyCurrentTime = 1263223234
Machine = "ADMN10-84463C.livad.liv.ac.uk"
PublicNetworkIpAddr = "<138.253.103.228:3578>"
CSD_CONDOR_POOL = "TEST"
CSD_REVISION = "23_JULY_2008"
CSD_HAS_SSL = TRUE
COLLECTOR_HOST_STRING = "ulgp3.liv.ac.uk"
CondorVersion = "$CondorVersion: 7.4.0 Oct 31 2009 BuildID: 193173 $"
CondorPlatform = "$CondorPlatform: INTEL-WINNT50 $"
SlotID = 1
VirtualMachineID = 1
VirtualMemory = 3151356
TotalDisk = 6430264
Disk = 6430264
CondorLoadAvg = 0.000000
LoadAvg = 0.060000
KeyboardIdle = 304
ConsoleIdle = 304
Memory = 2045
Cpus = 1
StartdIpAddr = "<138.253.103.228:3578>"
Arch = "INTEL"
OpSys = "WINNT51"
UidDomain = "liv.ac.uk"
FileSystemDomain = "ADMN10-84463C.livad.liv.ac.uk"
HasIOProxy = TRUE
CheckpointPlatform = "WINNT51 INTEL Unknown normal"
WindowsMajorVersion = 5
WindowsMinorVersion = 1
WindowsBuildNumber = 2600
WindowsServicePackMajorVersion = 2
WindowsServicePackMinorVersion = 0
WindowsProductType = 1
TotalVirtualMemory = 3151356
TotalCpus = 1
TotalMemory = 2045
KFlops = 769806
Mips = 4037
LastBenchmark = 1263222930
TotalLoadAvg = 0.060000
TotalCondorLoadAvg = 0.000000
ClockMin = 920
ClockDay = 1
TotalSlots = 1
TotalVirtualMachines = 1
HasFileTransfer = TRUE
HasPerFileEncryption = TRUE
HasReconnect = TRUE
HasMPI = TRUE
HasTDP = TRUE
HasJobDeferral = TRUE
HasJICLocalConfig = TRUE
HasJICLocalStdin = TRUE
HasWindowsRunAsOwner = TRUE,HasJICLocalConfig,HasJICLocalStdin,HasVM,HasWindowsRunAsOwner"
HasVM = FALSE
HibernationLevel = 0
HibernationState = "NONE"
CanHibernate = TRUE
HardwareAddress = "00:1A:A0:BE:27:01"
IsWakeOnLanSupported = TRUE
IsWakeOnLanEnabled = TRUE
IsWakeAble = TRUE
WakeOnLanSupportedFlags = "Magic Packet"
WakeOnLanEnabledFlags = "Magic Packet"
CpuBusyTime = 0
CpuIsBusy = FALSE
TimeToLive = 2147483647
State = "Unclaimed"
EnteredCurrentState = 1263222973
Activity = "Idle"
EnteredCurrentActivity = 1263222973
TotalTimeOwnerIdle = 10
TotalTimeUnclaimedIdle = 43
TotalTimeClaimedBusy = 261
Start = (Owner == "smithic")
Requirements = TRUE 
IsValidCheckpointPlatform = (((TARGET.JobUniverse == 1) == FALSE) || ((MY.CheckpointPlatform =!= UND
EFINED) && ((TARGET.LastCheckpointPlatform =?= MY.CheckpointPlatform) || (TARGET.NumCkpts == 0))))
MaxJobRetirementTime = 0
LastFetchWorkSpawned = 0MaxJobRetirementTime = 0
LastFetchWorkSpawned = 0
LastFetchWorkCompleted = 0
NextFetchWorkDelay = -1
CurrentRank = 0.000000
MonitorSelfTime = 1263223170
MonitorSelfCPUUsage = 0.214469
MonitorSelfImageSize = 46532.000000
MonitorSelfResidentSetSize = 9976
MonitorSelfAge = 250
MonitorSelfRegisteredSocketCount = 2
MyAddress = "<138.253.103.228:3578>"
LastHeardFrom = 1263223234
LastFetchWorkCompleted = 0
StarterAbilityList = "HasFileTransfer,HasPerFileEncryption,HasReconnect,HasMPI,HasTDP,HasJobDeferral





> -----Original Message-----
> From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-
> bounces@xxxxxxxxxxx] On Behalf Of Dan Bradley
> Sent: 11 January 2010 17:02
> To: Condor-Users Mail List
> Subject: Re: [Condor-users] condor_rooster failing to crow
> 
> Ian,
> 
> Sorry to hear you are having difficulties.  If it is caused by a bug,
> I'll have to eat crow.  Here are some things to help see where it might
> be going wrong.
> 
> The setting of MachineLastMatchTime is initiated by the negotiator.
> With D_FULLDEBUG turned on, you should see a line like the following in
> your NegotiatorLog:
> 
> Registering attempt to match offline machine MACHINE by USER.
> 
> This results in a MERGE_STARTD_AD command being sent to the collector.
> If you have D_COMMAND turned on in the collector, you should see that
> command being received in CollectorLog.
> 
> After that command has been received, the machine ad should contain
> MachineLastMatchTime.  You should be able to see that with condor_status
> -long.
> 
> If something overwrites the offline machine ad, then
> MachineLastMatchTime will go away until the next time the negotiator
> sets it (i.e. the next negotiation cycle where a job matches the offline
> machine).
> 
> --Dan
> 
> Smith, Ian wrote:
> > Dear All,
> >
> > I'm trying to use condor_rooster in Condor 7.4 to work with our Windows XP pool
> > but with only limited success. To keep comaptibility with our current power saving
> > set up I'm trying to avoid using the Condor power saving and intead I'm publishing
> > the ClassAds of offline machine via a cron so that condor_rooster can wake up
> > the relevant ones.
> >
> > The crux of the matter seems to be in the UNHIBERNATE expression. In the
> documentation
> > (p 216) it states that the default value is MachineLastMatchTime =!= UNDEFINED
> although
> > I find that it is atually MY.MachineLastMatchTime =!= UNDEFINED. I've tried both
> and neither
> > seem to work as neither  MachineLastMatchTime nor  MY.MachineLastMatchTime
> seem
> > to be set. The manual says that
> >
> > "the special attribute MachineLastMatchTime is updated in the ClassAds of offline
> machines
> > when the job would have been matched to the machine if it had been online"
> >
> > but this doesn't seem to be happening. Using condor_q -ana reveals
> >
> > 019.009:  Run analysis summary.  Of 1 machines,
> >       0 are rejected by your job's requirements
> >       0 reject your job because of their own requirements
> >       0 match but are serving users with a better priority in the pool
> >       0 match but reject the job for unknown reasons
> >       0 match but will not currently preempt their existing job
> >       1 match but are currently offline
> >       0 are available to run your job
> >
> > so the matchmaking is definitely working - it just seems that the machine ClassAd
> isn't
> > updated. If I set MachineLastMatchTime to some arbitrary value myself then
> >
> > ROOSTER_UNHIBERNATE=Offline && Unhibernate
> >
> > seems to evaluate to TRUE and the wake up kicks in.
> >
> > I've tried D_FULLBEBUG but I still can't track down where the problem is.
> >
> > Any ideas ?
> >
> > regards,
> >
> > -ian.
> >
> >
> > --------------------------------------------
> > Dr Ian C. Smith,
> > e-Science Team,
> > The University of Liverpool,
> > Computing Services Departmen
> >
> > _______________________________________________
> > Condor-users mailing list
> > To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> > subject: Unsubscribe
> > You can also unsubscribe by visiting
> > https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> >
> > The archives can be found at:
> > https://lists.cs.wisc.edu/archive/condor-users/
> >
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/