[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Need help diagnosing flocking problems



Hi Ben,

Your negotiator log contains the following troubling stanza:

02/22/12 15:12:30     Sending SEND_JOB_INFO/eom
02/22/12 15:12:30 condor_write(): Socket closed when trying to write
13 bytes to schedd bcotton@xxxxxxxxxxxxxxx, fd is 7
02/22/12 15:12:30 Buf::write(): condor_write() failed
02/22/12 15:12:30     Failed to send SEND_JOB_INFO/eom
02/22/12 15:12:30   Error: Ignoring submitter for this cycle

To which the schedd replies:

02/22/12 15:12:30 Can't receive ClaimId from negotiator

Once the negotiator and schedd have gotten that far into the conversation, they must have already successfully exchanged information in both directions and authorized each other. I can't think of anything in the condor configuration that would cause things to go wrong at that point.

What versions of condor are involved?

--Dan

On 2/22/12 3:57 PM, Ben Cotton wrote:
We're trying to get Campus Grid Factory [0] going to spin up PBS jobs
to run Condor and we're having some issues getting jobs to flock from
our schedd to the CGF host. I've been tinkering with security settings
and still can't seem to get the jobs to go. I've included
security-related settings and log messages from the hosts in question
in case something stands out to anyone here on the list. The CGF host
can see the queue of the schedd host and the schedd host can see the
pool of the CGF host:

-bash-3.2$ condor_q -name carter-fe00 -pool carter-adm


-- Schedd: carter-fe00.rcac.purdue.edu :<128.211.148.45:33757>
  ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
   1.0   bcotton         1/31 17:01   0+00:00:00 I  0   0.0  linux-sleep.sh 600
   1.1   bcotton         1/31 17:01   0+00:00:00 I  0   0.0  linux-sleep.sh 600
   1.2   bcotton         1/31 17:01   0+00:00:00 I  0   0.0  linux-sleep.sh 600
   1.3   bcotton         1/31 17:01   0+00:00:00 I  0   0.0  linux-sleep.sh 600
   1.4   bcotton         1/31 17:01   0+00:00:00 I  0   0.0  linux-sleep.sh 600
   1.5   bcotton         1/31 17:01   0+00:00:00 I  0   0.0  linux-sleep.sh 600
   1.6   bcotton         1/31 17:01   0+00:00:00 I  0   0.0  linux-sleep.sh 600
   1.7   bcotton         1/31 17:01   0+00:00:00 I  0   0.0  linux-sleep.sh 600
   1.8   bcotton         1/31 17:01   0+00:00:00 I  0   0.0  linux-sleep.sh 600
   1.9   bcotton         1/31 17:01   0+00:00:00 I  0   0.0  linux-sleep.sh 600

10 jobs; 10 idle, 0 running, 0 held
-bash-3.2$

[569 root@carter-fe00 /autohome/u100/bcotton/condor ]$ condor_status
-pool steele-cgf

Name               OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime

24062@steele-a100. LINUX      X86_64 Claimed   Busy     6.660  16046  0+00:00:02
30781@steele-a100. LINUX      X86_64 Unclaimed Idle     6.290  16046  0+00:03:58
20378@steele-a103. LINUX      X86_64 Unclaimed Idle     7.070  16046  0+00:05:04
                     Total Owner Claimed Unclaimed Matched Preempting Backfill

        X86_64/LINUX     3     0       1         2       0          0        0

               Total     3     0       1         2       0          0        0


Any brilliant ideas will be accepted.

[0] http://sourceforge.net/apps/trac/campusfactory


Thanks,
BC

#
# configuratioon from carter-fe00
#
[572 root@carter-fe00 /autohome/u100/bcotton/condor ]$
condor_config_val -dump | egrep 'ALLOW|DENY|SEC'
ALLOW_ADMINISTRATOR = $(CONDOR_HOST)
ALLOW_NEGOTIATOR = $(CONDOR_HOST)
ALLOW_NEGOTIATOR_SCHEDD = $(CONDOR_HOST), $(FLOCK_NEGOTIATOR_HOSTS)
ALLOW_OWNER = $(FULL_HOSTNAME), $(ALLOW_ADMINISTRATOR)
ALLOW_READ = *.purdue.edu, *.cs.wisc.edu, *.nd.edu, 149.165.232.*,
149.165.233.*, 149.165.234.*, 149.165.237.*, 149.165.238.*,
149.165.239.*, *.indiana.edu, *.pnc.edu, *.nanohub.org, *.unl.edu,
cycleserver.cyclecomputing.com
ALLOW_READ_COLLECTOR = $(ALLOW_READ), $(FLOCK_FROM)
ALLOW_READ_STARTD = $(ALLOW_READ), $(FLOCK_FROM)
ALLOW_WRITE = *
ALLOW_WRITE_COLLECTOR = $(ALLOW_WRITE), $(FLOCK_FROM)
ALLOW_WRITE_STARTD = $(ALLOW_WRITE), $(FLOCK_FROM)
DENY_WRITE = $(VPN_SUBNETS), $(RESNET_SUBNETS), $(PAL_SUBNETS)
SEC_ADVERTISE_SCHEDD_AUTHENTICATION = OPTIONAL
SEC_ADVERTISE_SCHEDD_AUTHENTICATION_METHODS = GSI, PASSWORD
SEC_ADVERTISE_SCHEDD_INTEGRITY = OPTIONAL
SEC_ADVERTISE_STARTD_AUTHENTICATION = OPTIONAL
SEC_ADVERTISE_STARTD_AUTHENTICATION_METHODS = PASSWORD
SEC_ADVERTISE_STARTD_INTEGRITY = OPTIONAL
SEC_CLIENT_AUTHENTICATION_METHODS = FS, PASSWORD, KERBEROS, GSI, CLAIMTOBE
SEC_DAEMON_AUTHENTICATION = OPTIONAL
SEC_DAEMON_AUTHENTICATION_METHODS = GSI, PASSWORD, CLAIMTOBE
SEC_DAEMON_ENCRYPTION = OPTIONAL
SEC_DAEMON_INTEGRITY = OPTIONAL
SEC_ENABLE_MATCH_PASSWORD_AUTHENTICATION = TRUE
SEC_NEGOTIATOR_AUTHENTICATION = OPTIONAL
SEC_NEGOTIATOR_AUTHENTICATION_METHODS = PASSWORD
SEC_NEGOTIATOR_INTEGRITY = OPTIONAL
SEC_PASSWORD_FILE = $(LOCK)/pool_password
[573 root@carter-fe00 /autohome/u100/bcotton/condor ]$
condor_config_val FLOCK_TO
steele-cgf.rcac.purdue.edu


#
# SchedLog from carter-fe00
#
02/22/12 15:12:30
02/22/12 15:12:30 Entered negotiate
02/22/12 15:12:30 Using negotiation protocol: NEGOTIATE
02/22/12 15:12:30 *** SwapSpace = 2147483647
02/22/12 15:12:30 *** ReservedSwap = 0
02/22/12 15:12:30 *** Shadow Size Estimate = 1800
02/22/12 15:12:30 *** Start Limit For Swap = 1193046
02/22/12 15:12:30 Negotiating for owner: bcotton@xxxxxxxxxxxxxxx
(flock level 1, pool steele-cgf.rcac.purdue.edu)
02/22/12 15:12:30
AutoCluster:config(JobUniverse,LastCheckpointPlatform,NumCkpts,cgf,Steele)
invoked
02/22/12 15:12:30 AutoCluster:config() significant attributes unchanged
02/22/12 15:12:30 Reusing prioritized runnable job list because
nothing has changed.
02/22/12 15:12:30 Job 1.0: is runnable
02/22/12 15:12:30 Sent job 1.0 (autocluster=1) to the negotiator
02/22/12 15:12:30 Can't receive ClaimId from negotiator
02/22/12 15:12:30 Failed to send NEGOTIATE to<128.210.9.99:52571>:
02/22/12 15:12:52 Evaluated periodic expressions in 0.000s, scheduling
next run in 61s
02/22/12 15:13:11 AUTHENTICATE: no available authentication methods succeeded!
02/22/12 15:13:11 DC_SECURITY: authentication of 128.210.9.99 failed
but was not required, so continuing.
02/22/12 15:13:11 Received TCP command 1111 (QMGMT_READ_CMD) from
unauthenticated@unmapped<128.210.9.99:57802>, access level READ
02/22/12 15:13:11 QMGR forked query
02/22/12 15:13:11 QMGR forked query done
02/22/12 15:13:11 DaemonCore: No more children processes to reap.

#
# configuration from steele-cgf
#
-bash-3.2$ condor_config_val -dump | egrep 'ALLOW|DENY|SEC'
ALLOW_ADMINISTRATOR = $(FULL_HOSTNAME)
ALLOW_NEGOTIATOR = $(CONDOR_HOST)
ALLOW_NEGOTIATOR_SCHEDD = $(CONDOR_HOST), $(FLOCK_NEGOTIATOR_HOSTS)
ALLOW_OWNER = $(FULL_HOSTNAME), $(ALLOW_ADMINISTRATOR)
ALLOW_READ = $(ALLOW_WRITE)
ALLOW_READ_COLLECTOR = $(ALLOW_READ), $(FLOCK_FROM)
ALLOW_READ_STARTD = $(ALLOW_READ), $(FLOCK_FROM)
ALLOW_WRITE = $(FLOCK_FROM), execute-side@matchsession,
$(INTERNAL_IPS), $(HOSTNAME), 128.211.148.45
ALLOW_WRITE_COLLECTOR = $(ALLOW_WRITE), $(FLOCK_FROM)
ALLOW_WRITE_STARTD = $(ALLOW_WRITE), $(FLOCK_FROM)
SEC_DEFAULT_AUTHENTICATION = PREFERRED
SEC_DEFAULT_AUTHENTICATION_METHODS = FS,CLAIMTOBE,ANONYMOUS
SEC_DEFAULT_ENCRYPTION = OPTIONAL
SEC_DEFAULT_INTEGRITY = OPTIONAL
SEC_DEFAULT_NEGOTIATION = OPTIONAL
SEC_ENABLE_MATCH_PASSWORD_AUTHENTICATION = TRUE
-bash-3.2$ condor_config_val FLOCK_FROM
steele-cgf.rcac.purdue.edu, carter-fe00.rcac.purdue.edu
-bash-3.2$

#
# NegotiatorLog from steele-cgf
#
02/22/12 15:12:30 ---------- Started Negotiation Cycle ----------
02/22/12 15:12:30 Phase 1:  Obtaining ads from collector ...
02/22/12 15:12:30   Getting all public ads ...
02/22/12 15:12:30 Trying to query collector<128.210.9.99:9618>
02/22/12 15:12:30   Sorting 12 ads ...
02/22/12 15:12:30 Ignoring submitter
cgfsteel@xxxxxxxxxxxxxxxxxxxxxxxxxx with no requested jobs
02/22/12 15:12:30   Getting startd private ads ...
02/22/12 15:12:30 Trying to query collector<128.210.9.99:9618>
02/22/12 15:12:30 Got ads: 12 public and 3 private
02/22/12 15:12:30 Public ads include 1 submitter, 3 startd
02/22/12 15:12:30 Phase 2:  Performing accounting ...
02/22/12 15:12:30 Entering compute_significant_attrs()
02/22/12 15:12:30 Leaving compute_significant_attrs() -
result=JobUniverse,LastCheckpointPlatform,NumCkpts,cgf,Steele
02/22/12 15:12:30 Phase 3:  Sorting submitter ads by priority ...
02/22/12 15:12:30 Phase 4.1:  Negotiating with schedds ...
02/22/12 15:12:30     numSlots = 3
02/22/12 15:12:30     slotWeightTotal = 3.000000
02/22/12 15:12:30     pieLeft = 3.000
02/22/12 15:12:30     NormalFactor = 1.000000
02/22/12 15:12:30     MaxPrioValue = 0.734205
02/22/12 15:12:30     NumSubmitterAds = 1
02/22/12 15:12:30   Negotiating with bcotton@xxxxxxxxxxxxxxx at
<128.211.148.45:33757>
02/22/12 15:12:30 0 seconds so far
02/22/12 15:12:30   Calculating submitter limit with the following parameters
02/22/12 15:12:30     SubmitterPrio       = 0.734205
02/22/12 15:12:30     SubmitterPrioFactor = 1.000000
02/22/12 15:12:30     submitterShare      = 1.000000
02/22/12 15:12:30     submitterAbsShare   = 1.000000
02/22/12 15:12:30     submitterLimit    = 3.000000
02/22/12 15:12:30     submitterUsage    = 0.000000
02/22/12 15:12:30 Socket to bcotton@xxxxxxxxxxxxxxx
(<128.211.148.45:33757>) not in cache, creating one
02/22/12 15:12:30 SocketCache:  Found unused slot 1
02/22/12 15:12:30     Sending SEND_JOB_INFO/eom
02/22/12 15:12:30     Getting reply from schedd ...
02/22/12 15:12:30     Got JOB_INFO command; getting classad/eom
02/22/12 15:12:30     Request 00001.00000:
02/22/12 15:12:30 matchmakingAlgorithm: limit 3.000000 used 0.000000
pieLeft 3.000000
02/22/12 15:12:30 Failed to evaluate NEGOTIATOR_POST_JOB_RANK
expression to a float.
02/22/12 15:12:30 Failed to evaluate NEGOTIATOR_POST_JOB_RANK
expression to a float.
02/22/12 15:12:30 Start of sorting MatchList (len=2)
02/22/12 15:12:30 Finished sorting MatchList
02/22/12 15:12:30       Sending PERMISSION, claim id, startdAd to schedd
02/22/12 15:12:30       Matched 1.0 bcotton@xxxxxxxxxxxxxxx
<128.211.148.45:33757>  preempting none
<172.18.24.110:47591?CCBID=128.210.9.99:9618#832>
30781@xxxxxxxxxxxxxxxxxxxxxxxxxxx
02/22/12 15:12:30       Notifying the accountant
02/22/12 15:12:30       Successfully matched with
30781@xxxxxxxxxxxxxxxxxxxxxxxxxxx
02/22/12 15:12:30     Sending SEND_JOB_INFO/eom
02/22/12 15:12:30 condor_write(): Socket closed when trying to write
13 bytes to schedd bcotton@xxxxxxxxxxxxxxx, fd is 7
02/22/12 15:12:30 Buf::write(): condor_write() failed
02/22/12 15:12:30     Failed to send SEND_JOB_INFO/eom
02/22/12 15:12:30   Error: Ignoring submitter for this cycle
02/22/12 15:12:30  resources used by bcotton@xxxxxxxxxxxxxxx are 1.000000
02/22/12 15:12:30  resources used scheddUsed= 1.000000
02/22/12 15:12:30  negotiateWithGroup resources used scheddAds length 0
02/22/12 15:12:30 ---------- Finished Negotiation Cycle ----------