[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] strange condor_advertise behavior.

I have a simple shell script (attached) to forward a classad from a number of clusters to a central collector/negotiator, from there to do matchmaking with Condor-G.

On the first 2 clusters I tried, it worked and I can see the classAd.

It is executing the command

condor_advertise -pool fermigrid1.fnal.gov UPDATE_STARTD_AD GCclassad.txt

and the contents of GCclassad.txt look like this:

MyType = "Machine"
Name = "fnpcg.fnal.gov:2119/jobmanager-condor"
gatekeeper_url = "fnpcg.fnal.gov:2119/jobmanager-condor"
TargetType = "Job"
Requirements = TRUE
Rank = 0.000000
CurrentRank = 0.000000
WantAdRevaluate = TRUE
CurMatches = 0
UpdateSequenceNumber = 1129319101
gluehostapplicationsoftwareruntimeenvironment = "VO-atlas-release-9.0.3 VO-atlas
glueceinfohostname = "fnal.gov"
gluesubclustername = "fnal.gov"
gluecestatestatus = "Production"
gluecepolicymaxcputime = 2880
gluecepolicymaxwallclocktime = 2880
glueceaccesscontrolbaserule = "VO:*"
GlueCEStateTotalCPUs = 27
gluecestatefreecpus = 0
GlueCEStateRunningJobs = 0
GlueCEStateWaitingJobs = 0
gluecestateestimatedresponsetime = 0

So on the central collector/negotiator, condor_status looks like this:

fngp-osg.fnal [?????????] [????] [????????] [???] [??] [Unknown]
fnpcg.fnal.go [?????????] [????] [????????] [???] [??] [Unknown]
vm1@fermigrid LINUX INTEL Unclaimed Idle 0.000 997 0+00:01:51
vm2@fermigrid LINUX INTEL Unclaimed Idle 0.490 997 0+01:35:23
vm3@fermigrid LINUX INTEL Unclaimed Idle 0.000 997 0+01:35:14
vm4@fermigrid LINUX INTEL Unclaimed Idle 0.000 997 0+01:35:11

                     Machines Owner Claimed Unclaimed Matched Preempting

         INTEL/LINUX        4     0       0         4       0          0

               Total        4     0       0         4       0          0

                    (Omitted 2 malformed ads in computed attribute totals)


If I do the following:

MyAddress = "<>"
LastHeardFrom = 1129319400
UpdatesTotal = 4
UpdatesSequenced = 0
UpdatesLost = 0
UpdatesHistory = "0x0000000000000000000000000000000

I see that the two classads which successfully are seen by the collector
have a field called MyAddress appended to the classad, a field which
is not in the classad file.'

There is a third node on which I am trying to run the same script.
I do not see this one show up in the collector.  Instead I see:

10/13 09:44:00 Got IP = '(null)'
10/13 09:44:00 No IP address in classAd
10/13 09:44:00 Error: Invalid StartAd
10/13 09:44:00 Could not make hashkey --- ignoring ad
10/13 09:44:00 Received malformed ad from command (0). Ignoring.

I'm guessing from that, that the condor schedd on that node,
which is an earlier version, 6.7.6, is configured slightly differently
and is not including the MyAddress field in the classad for whatever reason.

Any idea what the magic configuration tweak is to make it include
MyAddress in the classad?  Thanks for any help.

Steve Timm

Steven C. Timm, Ph.D  (630) 840-8525  timm@xxxxxxxx  http://home.fnal.gov/~timm/
Fermilab Computing Div/Core Support Services Dept./Scientific Computing Section
Assistant Group Leader, Farms and Clustered Systems Group
Lead of Computing Farms Team

Attachment: runclassad.sh
Description: Bourne shell script

Attachment: GCclassad.sh
Description: Bourne shell script