[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] condor_rooster failing to crow



Hi Ian,

Sorry for the late response.

It is expected that the 7.2.1 condor startd would advertise an unmatchable ClassAd when going offline (Requirements=False).

It is not expected that the 7.4.0 startd offline ad would be unmatchable. I have tested this functionality under unix and I have not found anything wrong with the offline ad. It is possible that something is different in a windows environment. Do you happen to have a copy of the bad offline ad, before you edited it?

When an offline machine is matched, there should be a message such as the following in the negotiator log with D_FULLDEBUG:

02/02 16:55:58 Registering attempt to match offline machine slot2@xxxxxxxxxxxxxxxxxx by dan.

If nothing but offline slots are available for a job, it will still be reported as "rejected" by the matchmaker with "no match found". However, the offline ad in the collector should have a new attribute MachineLastMatchTime indicating that the negotiator would have matched a job to the machine.

Thanks,
--Dan

Smith, Ian wrote:
In the interests of completeness I thought I'd better follow this up so say
that I think I may have found where the problem lies.

I've tried running Condor 7.4.0 on the central manager (Solaris 10) and
Condor 7.2.1 on the execute host (Win XP). When the PC goes into
hibernation a ClassAd is sent to the manager and recorded in the offline.log.
When I look at the negotiator log I see that there is no match with queued jobs but - and this is a big BUT - it looks like Requirements
is set to FALSE (using condor_status -l). If I merge in another ClassAd
(with MERGE_STARTD_AD) to set Requirements to a more sensible value then the matchmaking works. I found that I also had to set the Unhibernate
expression explicity in the offline ClassAd (Unhibernate =  LastMatchTime =!= UNDEFINED)
and then bingo condor_rooster does infact wake up the PC.

Moving to Condor 7.4.0 on the execute host threw up some more suprises.
I found that although the machine did not power down a ClassAd was
sent out when it should have done and this was recorded in the offline.log.
I stopped Condor on the execute host and advertised the offline ClassAd myself. Again the matchmaking failed and it was only after I removed a
lot of the attributes that it worked. I've listed a working ClassAd below.

regards,

-ian.

Offline = TRUE
MyType = "Machine"
ClassAdLifetime = 2000000
TargetType = "Job"
Name = "ADMN10-84463C.livad.liv.ac.uk"
Rank = 0.000000
CpuBusy = ((LoadAvg - CondorLoadAvg) >= 0.500000)
Unhibernate = MachineLastMatchTime =!= UNDEFINED
MyCurrentTime = 1263223234
Machine = "ADMN10-84463C.livad.liv.ac.uk"
PublicNetworkIpAddr = "<138.253.103.228:3578>"
CSD_CONDOR_POOL = "TEST"
CSD_REVISION = "23_JULY_2008"
CSD_HAS_SSL = TRUE
COLLECTOR_HOST_STRING = "ulgp3.liv.ac.uk"
CondorVersion = "$CondorVersion: 7.4.0 Oct 31 2009 BuildID: 193173 $"
CondorPlatform = "$CondorPlatform: INTEL-WINNT50 $"
SlotID = 1
VirtualMachineID = 1
VirtualMemory = 3151356
TotalDisk = 6430264
Disk = 6430264
CondorLoadAvg = 0.000000
LoadAvg = 0.060000
KeyboardIdle = 304
ConsoleIdle = 304
Memory = 2045
Cpus = 1
StartdIpAddr = "<138.253.103.228:3578>"
Arch = "INTEL"
OpSys = "WINNT51"
UidDomain = "liv.ac.uk"
FileSystemDomain = "ADMN10-84463C.livad.liv.ac.uk"
HasIOProxy = TRUE
CheckpointPlatform = "WINNT51 INTEL Unknown normal"
WindowsMajorVersion = 5
WindowsMinorVersion = 1
WindowsBuildNumber = 2600
WindowsServicePackMajorVersion = 2
WindowsServicePackMinorVersion = 0
WindowsProductType = 1
TotalVirtualMemory = 3151356
TotalCpus = 1
TotalMemory = 2045
KFlops = 769806
Mips = 4037
LastBenchmark = 1263222930
TotalLoadAvg = 0.060000
TotalCondorLoadAvg = 0.000000
ClockMin = 920
ClockDay = 1
TotalSlots = 1
TotalVirtualMachines = 1
HasFileTransfer = TRUE
HasPerFileEncryption = TRUE
HasReconnect = TRUE
HasMPI = TRUE
HasTDP = TRUE
HasJobDeferral = TRUE
HasJICLocalConfig = TRUE
HasJICLocalStdin = TRUE
HasWindowsRunAsOwner = TRUE,HasJICLocalConfig,HasJICLocalStdin,HasVM,HasWindowsRunAsOwner"
HasVM = FALSE
HibernationLevel = 0
HibernationState = "NONE"
CanHibernate = TRUE
HardwareAddress = "00:1A:A0:BE:27:01"
IsWakeOnLanSupported = TRUE
IsWakeOnLanEnabled = TRUE
IsWakeAble = TRUE
WakeOnLanSupportedFlags = "Magic Packet"
WakeOnLanEnabledFlags = "Magic Packet"
CpuBusyTime = 0
CpuIsBusy = FALSE
TimeToLive = 2147483647
State = "Unclaimed"
EnteredCurrentState = 1263222973
Activity = "Idle"
EnteredCurrentActivity = 1263222973
TotalTimeOwnerIdle = 10
TotalTimeUnclaimedIdle = 43
TotalTimeClaimedBusy = 261
Start = (Owner == "smithic")
Requirements = TRUE IsValidCheckpointPlatform = (((TARGET.JobUniverse == 1) == FALSE) || ((MY.CheckpointPlatform =!= UND
EFINED) && ((TARGET.LastCheckpointPlatform =?= MY.CheckpointPlatform) || (TARGET.NumCkpts == 0))))
MaxJobRetirementTime = 0
LastFetchWorkSpawned = 0MaxJobRetirementTime = 0
LastFetchWorkSpawned = 0
LastFetchWorkCompleted = 0
NextFetchWorkDelay = -1
CurrentRank = 0.000000
MonitorSelfTime = 1263223170
MonitorSelfCPUUsage = 0.214469
MonitorSelfImageSize = 46532.000000
MonitorSelfResidentSetSize = 9976
MonitorSelfAge = 250
MonitorSelfRegisteredSocketCount = 2
MyAddress = "<138.253.103.228:3578>"
LastHeardFrom = 1263223234
LastFetchWorkCompleted = 0
StarterAbilityList = "HasFileTransfer,HasPerFileEncryption,HasReconnect,HasMPI,HasTDP,HasJobDeferral





-----Original Message-----
From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-
bounces@xxxxxxxxxxx] On Behalf Of Dan Bradley
Sent: 11 January 2010 17:02
To: Condor-Users Mail List
Subject: Re: [Condor-users] condor_rooster failing to crow

Ian,

Sorry to hear you are having difficulties.  If it is caused by a bug,
I'll have to eat crow.  Here are some things to help see where it might
be going wrong.

The setting of MachineLastMatchTime is initiated by the negotiator.
With D_FULLDEBUG turned on, you should see a line like the following in
your NegotiatorLog:

Registering attempt to match offline machine MACHINE by USER.

This results in a MERGE_STARTD_AD command being sent to the collector.
If you have D_COMMAND turned on in the collector, you should see that
command being received in CollectorLog.

After that command has been received, the machine ad should contain
MachineLastMatchTime.  You should be able to see that with condor_status
-long.

If something overwrites the offline machine ad, then
MachineLastMatchTime will go away until the next time the negotiator
sets it (i.e. the next negotiation cycle where a job matches the offline
machine).

--Dan

Smith, Ian wrote:
Dear All,

I'm trying to use condor_rooster in Condor 7.4 to work with our Windows XP pool
but with only limited success. To keep comaptibility with our current power saving
set up I'm trying to avoid using the Condor power saving and intead I'm publishing
the ClassAds of offline machine via a cron so that condor_rooster can wake up
the relevant ones.

The crux of the matter seems to be in the UNHIBERNATE expression. In the
documentation
(p 216) it states that the default value is MachineLastMatchTime =!= UNDEFINED
although
I find that it is atually MY.MachineLastMatchTime =!= UNDEFINED. I've tried both
and neither
seem to work as neither  MachineLastMatchTime nor  MY.MachineLastMatchTime
seem
to be set. The manual says that

"the special attribute MachineLastMatchTime is updated in the ClassAds of offline
machines
when the job would have been matched to the machine if it had been online"

but this doesn't seem to be happening. Using condor_q -ana reveals

019.009:  Run analysis summary.  Of 1 machines,
      0 are rejected by your job's requirements
      0 reject your job because of their own requirements
      0 match but are serving users with a better priority in the pool
      0 match but reject the job for unknown reasons
      0 match but will not currently preempt their existing job
      1 match but are currently offline
      0 are available to run your job

so the matchmaking is definitely working - it just seems that the machine ClassAd
isn't
updated. If I set MachineLastMatchTime to some arbitrary value myself then

ROOSTER_UNHIBERNATE=Offline && Unhibernate

seems to evaluate to TRUE and the wake up kicks in.

I've tried D_FULLBEBUG but I still can't track down where the problem is.

Any ideas ?

regards,

-ian.


--------------------------------------------
Dr Ian C. Smith,
e-Science Team,
The University of Liverpool,
Computing Services Departmen

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/