[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] problems with htcondor-ce 3.2.1-1 + condor 8.8.1



Hello,

I would need some help to get working a new HTCondor-CE instance.
So far i have a working test cluster with HTCondor-CE 3.1.0-1.el7 / condor 8.6.13.

I'm working to setup a second instance with latest stable releases:

- ce02-htc.cr.cnaf.infn.it:9619, HTCondor-CE 3.2.1-1.el7 / condor 8.6.13
- htc-2.cr.cnaf.infn.it,ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ Central Manager / Collector 8.6.13

However submitted jobs (condor-ce-trace from a user interface) are going held:

[root@ce02-htc condor]# condor_ce_q

-- Schedd: ce02-htc.cr.cnaf.infn.it : <131.154.192.41:19416> @ 03/08/19 17:58:51
OWNERÂÂÂ BATCH_NAMEÂÂÂ SUBMITTEDÂÂ DONEÂÂ RUNÂÂÂ IDLEÂÂ HOLDÂ TOTAL JOB_IDS
dteam039 ID: 26ÂÂÂÂÂÂ 3/7Â 17:35ÂÂÂÂÂ _ÂÂÂÂÂ _ÂÂÂÂÂ _ÂÂÂÂÂ 1ÂÂÂÂÂ 1 26.0
dteam039 ID: 27ÂÂÂÂÂÂ 3/7Â 17:42ÂÂÂÂÂ _ÂÂÂÂÂ _ÂÂÂÂÂ _ÂÂÂÂÂ 1ÂÂÂÂÂ 1 27.0
dteam039 ID: 28ÂÂÂÂÂÂ 3/8Â 13:15ÂÂÂÂÂ _ÂÂÂÂÂ _ÂÂÂÂÂ _ÂÂÂÂÂ 1ÂÂÂÂÂ 1 28.0
dteam039 ID: 31ÂÂÂÂÂÂ 3/8Â 17:21ÂÂÂÂÂ _ÂÂÂÂÂ _ÂÂÂÂÂ _ÂÂÂÂÂ 1ÂÂÂÂÂ 1 31.0


Apparently the match with job router should be ok:

[root@ce02-htc condor]# condor_ce_config_val -dump JOB_ROUTER_ENTRIES
# Configuration from machine: ce02-htc.cr.cnaf.infn.it

# Parameters with names that match JOB_ROUTER_ENTRIES:
JOB_ROUTER_ENTRIES = [
name = "condor_pool_dteam";
TargetUniverse = 5;
Requirements = (regexp("dteam", TARGET.x509UserProxyVoName));
set_requirements = (TARGET.Arch == "X86_64") && (TARGET.OpSys == "LINUX");
MaxJobs = 100;
MaxIdleJobs = 100;
]
[SNIP]

However:
[root@ce02-htc condor]# condor_ce_q -analyze 31

-- Schedd: ce02-htc.cr.cnaf.infn.it : <131.154.192.41:19416>

031.000:Â Job is held.

Hold reason: HTCondor-CE held job due to no matching routes, route job limit, or route failure threshold; see 'HTCondor-CE Troubleshooting Guide'

Looking into the condor-ce logs i see these errors:

JobRouterLog:

03/08/19 18:17:32 (D_ALWAYS) SECMAN: required authentication with collector at <131.154.195.32:9618> failed, so aborting command QUERY_SCHEDD_ADS. 03/08/19 18:17:32 (D_ALWAYS) ERROR: AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using FS 03/08/19 18:17:32 (D_ALWAYS) ERROR (pool htc-2.cr.cnaf.infn.it:9618) Can't find address of schedd


CollectorLog:

03/08/19 18:11:34 (D_ALWAYS:2) Trying to update collector <131.154.192.41:9619> 03/08/19 18:11:34 (D_ALWAYS:2) Attempting to send update via TCP to collector ce02-htc.cr.cnaf.infn.it <131.154.192.41:9619> 03/08/19 18:11:34 (D_ALWAYS:2) Sent ad to 1 collectors for dteam039@htc_tier1 Hit=4 Tot=4 Idle=0 Run=0 03/08/19 18:11:34 (D_ALWAYS:2) ============ Begin clean_shadow_recs ============= 03/08/19 18:11:34 (D_ALWAYS:2) ============ End clean_shadow_recs ============= 03/08/19 18:11:34 (D_ALWAYS:2) Job 32.0 held for spooling. Checking how long... 03/08/19 18:11:34 (D_ALWAYS:2) Attribute StageInStart not set in 32.0. Set it.
03/08/19 18:11:34 (D_ALWAYS:2) Sending RESCHEDULE command to negotiator(s)
03/08/19 18:11:34 (D_ALWAYS:2) Will use TCP to update collector ce02-htc.cr.cnaf.infn.it <131.154.192.41:9619> 03/08/19 18:11:34 (D_ALWAYS:2) Trying to query collector <131.154.192.41:9619>
03/08/19 18:11:35 (D_ALWAYS) Can't find address for negotiator
03/08/19 18:11:35 (D_ALWAYS|D_FAILURE) Failed to send RESCHEDULE to unknown daemon: 03/08/19 18:11:35 (cid:19) (D_AUDIT) Command=SPOOL_JOB_FILES_WITH_PERMS, peer=<131.154.192.239:24028> 03/08/19 18:11:35 (cid:19) (D_AUDIT) AuthMethod=GSI, AuthId=/C=IT/O=INFN/OU=Personal Certificate/L=CNAF/CN=Stefano Dal Pra,/dteam/Role=NULL/Capability=NULL, CondorId=dteam039@htc_tier1
03/08/19 18:11:35 (D_ALWAYS:2) spoolJobFiles(): read JobAdsArrayLen - 1
[...]
03/08/19 18:11:40 (D_ALWAYS:2) Sent ad to 1 collectors for dteam039@htc_tier1 Hit=4 Tot=4 Idle=1 Run=0 03/08/19 18:11:40 (D_ALWAYS:2) ============ Begin clean_shadow_recs ============= 03/08/19 18:11:40 (D_ALWAYS:2) ============ End clean_shadow_recs =============
03/08/19 18:11:40 (D_ALWAYS:2) Sending RESCHEDULE command to negotiator(s)
03/08/19 18:11:40 (D_ALWAYS:2) Will use TCP to update collector ce02-htc.cr.cnaf.infn.it <131.154.192.41:9619> 03/08/19 18:11:40 (D_ALWAYS:2) Trying to query collector <131.154.192.41:9619>
03/08/19 18:11:40 (D_ALWAYS) Can't find address for negotiator
03/08/19 18:11:40 (D_ALWAYS|D_FAILURE) Failed to send RESCHEDULE to unknown daemon: 03/08/19 18:11:40 (D_ALWAYS:2) ForkWorker::Fork: New child of 1132279 = 1132521

###########

CollectorLog (in the Central Manager/Collector, ce02-htc)

03/08/19 21:34:50 DC_AUTHENTICATE: required authentication of 131.154.192.41 failed: AUTHENTICATE:1003:Failed to aut henticate with any method|AUTHENTICATE:1004:Failed to authenticate using PASSWORD|AUTHENTICATE:1004:Failed to authen
ticate using FS|FS:1004:Unable to lstat(/tmp/FS_XXX49WRX5)

##############
.
I've been comparing configurations with that of the working htcondor-ce (ce01-htc.cr.cnaf.infn.it), but i haven't found a solution.

These are the SEC_* settings
[root@ce02-htc ~]# condor_ce_config_val -dump SEC_ | egrep -v '^#'

CEVIEW.SEC_CLIENT_AUTHENTICATION_METHODS = FS
CEVIEW.SEC_CLIENT_NEGOTIATION = PREFERRED
MASTER.SEC_DEFAULT_AUTHENTICATION_METHODS = FS, GSI
SCHEDD.SEC_DAEMON_AUTHENTICATION_METHODS = FS,GSI
SCHEDD.SEC_WRITE_AUTHENTICATION_METHODS = FS,GSI
SEC_CLAIMTOBE_INCLUDE_DOMAIN = false
SEC_CLAIMTOBE_USER =
SEC_CLIENT_AUTHENTICATION = OPTIONAL
SEC_CLIENT_AUTHENTICATION_METHODS = GSI,FS
SEC_CLIENT_ENCRYPTION = OPTIONAL
SEC_CLIENT_INTEGRITY = OPTIONAL
SEC_CREDENTIAL_REFRESH_INTERVAL = -1
SEC_DEBUG_PRINT_KEYS = false
SEC_DEFAULT_AUTHENTICATION = REQUIRED
SEC_DEFAULT_AUTHENTICATION_METHODS = GSI
SEC_DEFAULT_AUTHENTICATION_TIMEOUT = 20
SEC_DEFAULT_ENCRYPTION = OPTIONAL
SEC_DEFAULT_INTEGRITY = REQUIRED
SEC_ENABLE_MATCH_PASSWORD_AUTHENTICATION = true
SEC_INVALIDATE_SESSIONS_VIA_TCP = true
SEC_PASSWORD_DOMAIN =
SEC_PASSWORD_FILE =
SEC_READ_AUTHENTICATION = OPTIONAL
SEC_READ_ENCRYPTION = OPTIONAL
SEC_READ_INTEGRITY = OPTIONAL
SEC_SESSION_DURATION_SLOP = 20
SEC_TCP_SESSION_TIMEOUT = 20


I tried adding password authentication:
[root@ce02-htc ~]# grep SEC_CLIENT_AUTHENTICATION_METHODS /etc/condor-ce/config.d/01-common-auth.conf
SEC_CLIENT_AUTHENTICATION_METHODS=FS,GSI,PASSWORD

but then it seems to be overriden:
[root@ce02-htc ~]# condor_ce_config_val -v SEC_CLIENT_AUTHENTICATION_METHODS
SEC_CLIENT_AUTHENTICATION_METHODS = GSI,FS
Â# at: <Environment>
Â# raw: SEC_CLIENT_AUTHENTICATION_METHODS = GSI,FS

by the condor_ce_* wrapper commands.


A few things that puzzles me:

[root@ce02-htc ~]# condor_ce_config_val -v COLLECTOR_PORT
COLLECTOR_PORT = 9618
Â# at: <Default>
Â# raw: COLLECTOR_PORT = 9618

[root@ce02-htc ~]# condor_config_val -v COLLECTOR_PORT
COLLECTOR_PORT = 9618
Â# at: <Default>
Â# raw: COLLECTOR_PORT = 9618


But the other machine has:

[root@ce01-htc ~]# condor_ce_config_val -v COLLECTOR_PORT
COLLECTOR_PORT = 9619
Â# at: /usr/share/condor-ce/config.d/01-common-collector-defaults.conf, line 11
Â# raw: COLLECTOR_PORT = 9619

[root@ce01-htc ~]# condor_config_val -v COLLECTOR_PORT
COLLECTOR_PORT = 9618
Â# at: <Default>
Â# raw: COLLECTOR_PORT = 9618


The "older" CE has a blah rpm:
[root@ce01-htc ~]# rpm -qa | grep blah
condor-classads-blah-patch-0.0.1-1.el7.centos.x86_64
blahp-1.18.35.bosco-1.osg34.el7.x86_64

But i have not found a blahp rpm in the repo for 8.8.1.
How does work a 3.2.1 HTCondor-CE on top of a non condor batch system?


Thank You for any help,
Stefano