[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] problems with htcondor-ce 3.2.1-1 + condor 8.8.1



Hello,

On 08/03/19 22:15, Brian Lin wrote:
Stefano,

On the CE host, is the local condor running and configured as a submit
host?
Yes, condor_submit from ce02-htc does work:
[sdalpra@ce02-htc htjobs]$ condor_submit test.sub
Submitting job(s).
1 job(s) submitted to cluster 5.
[sdalpra@ce02-htc htjobs]$ condor_history
ÂIDÂÂÂÂ OWNERÂÂÂÂÂÂÂÂÂ SUBMITTEDÂÂ RUN_TIMEÂÂÂÂ ST COMPLETED CMD
ÂÂ 5.0ÂÂ sdalpraÂÂÂÂÂÂÂÂ 3/8Â 22:18ÂÂ 0+00:00:01 CÂÂ 3/8Â 22:18 /bin/hostname

looking at ce02-htc:/var/log/condor/SchedLog
there is the log for the local submission:
03/08/19 22:18:45 (pid:1097660) Started shadow for job 5.0 on slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx <131.154.194.216:9618?addrs=131.154.194.216-9618&noUDP&sock=21537_2b4a_3> for sdalpra, (shadow pid = 1178025)

However after a grid submission to the CE i see nothing related in /var/log/condor/SchedLog.

  This error in the JobRouterLog leads me to believe that there's a
communication error between the CE job router and the local HTCondor's
schedd:

03/08/19 18:17:32 (D_ALWAYS) ERROR (pool htc-2.cr.cnaf.infn.it:9618)
Can't find address of schedd

[root@ce02-htc ~]# condor_ce_config_val -dump JOB_ROUTER_SCHEDD
# Configuration from machine: ce02-htc.cr.cnaf.infn.it

# Parameters with names that match JOB_ROUTER_SCHEDD:
JOB_ROUTER_SCHEDD2_NAME = ce02-htc.cr.cnaf.infn.it
JOB_ROUTER_SCHEDD2_POOL = htc-2.cr.cnaf.infn.it:9618
JOB_ROUTER_SCHEDD2_SPOOL = /var/lib/condor/spool

You may find relevant errors in /var/log/condor/SchedLog if there are
incompatibilities in the SEC_ configuration between HTCondor-CE and the
local HTCondor.

the SEC_settings seems to be identical to the ones of the working ce.

I also tried to "import" ce configuration files from the older one and just adjusting hostnames,
but still no luck.

Stefano


We don't set COLLECTOR_PORT explicitly but instead set COLLECTOR_HOST
(https://github.com/opensciencegrid/htcondor-ce/blob/master/config/condor_config#L13-L15)
so I believe that's fine.

If you're getting HTCondor from the CHTC repositories, the blahp is
built-in. It's curious that you have the blahp RPM on your "old CE" but
you shouldn't need it.

- Brian

On 3/8/19 2:43 PM, Stefano Dal Pra wrote:
Hello,

I would need some help to get working a new HTCondor-CE instance.
So far i have a working test cluster with HTCondor-CE 3.1.0-1.el7 /
condor 8.6.13.

I'm working to setup a second instance with latest stable releases:

- ce02-htc.cr.cnaf.infn.it:9619, HTCondor-CE 3.2.1-1.el7 / condor 8.6.13
- htc-2.cr.cnaf.infn.it,ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ Central Manager / Collector 8.6.13

However submitted jobs (condor-ce-trace from a user interface) are
going held:

[root@ce02-htc condor]# condor_ce_q

-- Schedd: ce02-htc.cr.cnaf.infn.it : <131.154.192.41:19416> @
03/08/19 17:58:51
OWNERÂÂÂ BATCH_NAMEÂÂÂ SUBMITTEDÂÂ DONEÂÂ RUNÂÂÂ IDLEÂÂ HOLD TOTAL
JOB_IDS
dteam039 ID: 26ÂÂÂÂÂÂ 3/7Â 17:35ÂÂÂÂÂ _ÂÂÂÂÂ _ÂÂÂÂÂ _ÂÂÂÂÂ 1 1 26.0
dteam039 ID: 27ÂÂÂÂÂÂ 3/7Â 17:42ÂÂÂÂÂ _ÂÂÂÂÂ _ÂÂÂÂÂ _ÂÂÂÂÂ 1 1 27.0
dteam039 ID: 28ÂÂÂÂÂÂ 3/8Â 13:15ÂÂÂÂÂ _ÂÂÂÂÂ _ÂÂÂÂÂ _ÂÂÂÂÂ 1 1 28.0
dteam039 ID: 31ÂÂÂÂÂÂ 3/8Â 17:21ÂÂÂÂÂ _ÂÂÂÂÂ _ÂÂÂÂÂ _ÂÂÂÂÂ 1 1 31.0


Apparently the match with job router should be ok:

[root@ce02-htc condor]# condor_ce_config_val -dump JOB_ROUTER_ENTRIES
# Configuration from machine: ce02-htc.cr.cnaf.infn.it

# Parameters with names that match JOB_ROUTER_ENTRIES:
JOB_ROUTER_ENTRIES = [
name = "condor_pool_dteam";
TargetUniverse = 5;
Requirements = (regexp("dteam", TARGET.x509UserProxyVoName));
set_requirements = (TARGET.Arch == "X86_64") && (TARGET.OpSys ==
"LINUX");
MaxJobs = 100;
MaxIdleJobs = 100;
]
[SNIP]

However:
[root@ce02-htc condor]# condor_ce_q -analyze 31

-- Schedd: ce02-htc.cr.cnaf.infn.it : <131.154.192.41:19416>

031.000:Â Job is held.

Hold reason: HTCondor-CE held job due to no matching routes, route job
limit, or route failure threshold; see 'HTCondor-CE Troubleshooting
Guide'

Looking into the condor-ce logs i see these errors:

JobRouterLog:

03/08/19 18:17:32 (D_ALWAYS) SECMAN: required authentication with
collector at <131.154.195.32:9618> failed, so aborting command
QUERY_SCHEDD_ADS.
03/08/19 18:17:32 (D_ALWAYS) ERROR: AUTHENTICATE:1003:Failed to
authenticate with any method|AUTHENTICATE:1004:Failed to authenticate
using FS
03/08/19 18:17:32 (D_ALWAYS) ERROR (pool htc-2.cr.cnaf.infn.it:9618)
Can't find address of schedd


CollectorLog:

03/08/19 18:11:34 (D_ALWAYS:2) Trying to update collector
<131.154.192.41:9619>
03/08/19 18:11:34 (D_ALWAYS:2) Attempting to send update via TCP to
collector ce02-htc.cr.cnaf.infn.it <131.154.192.41:9619>
03/08/19 18:11:34 (D_ALWAYS:2) Sent ad to 1 collectors for
dteam039@htc_tier1 Hit=4 Tot=4 Idle=0 Run=0
03/08/19 18:11:34 (D_ALWAYS:2) ============ Begin clean_shadow_recs
=============
03/08/19 18:11:34 (D_ALWAYS:2) ============ End clean_shadow_recs
=============
03/08/19 18:11:34 (D_ALWAYS:2) Job 32.0 held for spooling. Checking
how long...
03/08/19 18:11:34 (D_ALWAYS:2) Attribute StageInStart not set in 32.0.
Set it.
03/08/19 18:11:34 (D_ALWAYS:2) Sending RESCHEDULE command to
negotiator(s)
03/08/19 18:11:34 (D_ALWAYS:2) Will use TCP to update collector
ce02-htc.cr.cnaf.infn.it <131.154.192.41:9619>
03/08/19 18:11:34 (D_ALWAYS:2) Trying to query collector
<131.154.192.41:9619>
03/08/19 18:11:35 (D_ALWAYS) Can't find address for negotiator
03/08/19 18:11:35 (D_ALWAYS|D_FAILURE) Failed to send RESCHEDULE to
unknown daemon:
03/08/19 18:11:35 (cid:19) (D_AUDIT)
Command=SPOOL_JOB_FILES_WITH_PERMS, peer=<131.154.192.239:24028>
03/08/19 18:11:35 (cid:19) (D_AUDIT) AuthMethod=GSI,
AuthId=/C=IT/O=INFN/OU=Personal Certificate/L=CNAF/CN=Stefano Dal
Pra,/dteam/Role=NULL/Capability=NULL, CondorId=dteam039@htc_tier1
03/08/19 18:11:35 (D_ALWAYS:2) spoolJobFiles(): read JobAdsArrayLen - 1
[...]
03/08/19 18:11:40 (D_ALWAYS:2) Sent ad to 1 collectors for
dteam039@htc_tier1 Hit=4 Tot=4 Idle=1 Run=0
03/08/19 18:11:40 (D_ALWAYS:2) ============ Begin clean_shadow_recs
=============
03/08/19 18:11:40 (D_ALWAYS:2) ============ End clean_shadow_recs
=============
03/08/19 18:11:40 (D_ALWAYS:2) Sending RESCHEDULE command to
negotiator(s)
03/08/19 18:11:40 (D_ALWAYS:2) Will use TCP to update collector
ce02-htc.cr.cnaf.infn.it <131.154.192.41:9619>
03/08/19 18:11:40 (D_ALWAYS:2) Trying to query collector
<131.154.192.41:9619>
03/08/19 18:11:40 (D_ALWAYS) Can't find address for negotiator
03/08/19 18:11:40 (D_ALWAYS|D_FAILURE) Failed to send RESCHEDULE to
unknown daemon:
03/08/19 18:11:40 (D_ALWAYS:2) ForkWorker::Fork: New child of 1132279
= 1132521

###########

CollectorLog (in the Central Manager/Collector, ce02-htc)

03/08/19 21:34:50 DC_AUTHENTICATE: required authentication of
131.154.192.41 failed: AUTHENTICATE:1003:Failed to aut
henticate with any method|AUTHENTICATE:1004:Failed to authenticate
using PASSWORD|AUTHENTICATE:1004:Failed to authen
ticate using FS|FS:1004:Unable to lstat(/tmp/FS_XXX49WRX5)

##############
.
I've been comparing configurations with that of the working
htcondor-ce (ce01-htc.cr.cnaf.infn.it), but i haven't found a solution.

These are the SEC_* settings
[root@ce02-htc ~]# condor_ce_config_val -dump SEC_ | egrep -v '^#'

CEVIEW.SEC_CLIENT_AUTHENTICATION_METHODS = FS
CEVIEW.SEC_CLIENT_NEGOTIATION = PREFERRED
MASTER.SEC_DEFAULT_AUTHENTICATION_METHODS = FS, GSI
SCHEDD.SEC_DAEMON_AUTHENTICATION_METHODS = FS,GSI
SCHEDD.SEC_WRITE_AUTHENTICATION_METHODS = FS,GSI
SEC_CLAIMTOBE_INCLUDE_DOMAIN = false
SEC_CLAIMTOBE_USER =
SEC_CLIENT_AUTHENTICATION = OPTIONAL
SEC_CLIENT_AUTHENTICATION_METHODS = GSI,FS
SEC_CLIENT_ENCRYPTION = OPTIONAL
SEC_CLIENT_INTEGRITY = OPTIONAL
SEC_CREDENTIAL_REFRESH_INTERVAL = -1
SEC_DEBUG_PRINT_KEYS = false
SEC_DEFAULT_AUTHENTICATION = REQUIRED
SEC_DEFAULT_AUTHENTICATION_METHODS = GSI
SEC_DEFAULT_AUTHENTICATION_TIMEOUT = 20
SEC_DEFAULT_ENCRYPTION = OPTIONAL
SEC_DEFAULT_INTEGRITY = REQUIRED
SEC_ENABLE_MATCH_PASSWORD_AUTHENTICATION = true
SEC_INVALIDATE_SESSIONS_VIA_TCP = true
SEC_PASSWORD_DOMAIN =
SEC_PASSWORD_FILE =
SEC_READ_AUTHENTICATION = OPTIONAL
SEC_READ_ENCRYPTION = OPTIONAL
SEC_READ_INTEGRITY = OPTIONAL
SEC_SESSION_DURATION_SLOP = 20
SEC_TCP_SESSION_TIMEOUT = 20


I tried adding password authentication:
[root@ce02-htc ~]# grep SEC_CLIENT_AUTHENTICATION_METHODS
/etc/condor-ce/config.d/01-common-auth.conf
SEC_CLIENT_AUTHENTICATION_METHODS=FS,GSI,PASSWORD

but then it seems to be overriden:
[root@ce02-htc ~]# condor_ce_config_val -v
SEC_CLIENT_AUTHENTICATION_METHODS
SEC_CLIENT_AUTHENTICATION_METHODS = GSI,FS
 Â# at: <Environment>
 Â# raw: SEC_CLIENT_AUTHENTICATION_METHODS = GSI,FS

by the condor_ce_* wrapper commands.


A few things that puzzles me:

[root@ce02-htc ~]# condor_ce_config_val -v COLLECTOR_PORT
COLLECTOR_PORT = 9618
 Â# at: <Default>
 Â# raw: COLLECTOR_PORT = 9618

[root@ce02-htc ~]# condor_config_val -v COLLECTOR_PORT
COLLECTOR_PORT = 9618
 Â# at: <Default>
 Â# raw: COLLECTOR_PORT = 9618


But the other machine has:

[root@ce01-htc ~]# condor_ce_config_val -v COLLECTOR_PORT
COLLECTOR_PORT = 9619
 Â# at:
/usr/share/condor-ce/config.d/01-common-collector-defaults.conf, line 11
 Â# raw: COLLECTOR_PORT = 9619

[root@ce01-htc ~]# condor_config_val -v COLLECTOR_PORT
COLLECTOR_PORT = 9618
 Â# at: <Default>
 Â# raw: COLLECTOR_PORT = 9618


The "older" CE has a blah rpm:
[root@ce01-htc ~]# rpm -qa | grep blah
condor-classads-blah-patch-0.0.1-1.el7.centos.x86_64
blahp-1.18.35.bosco-1.osg34.el7.x86_64

But i have not found a blahp rpm in the repo for 8.8.1.
How does work a 3.2.1 HTCondor-CE on top of a non condor batch system?


Thank You for any help,
Stefano

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/