[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] problems with htcondor-ce 3.2.1-1 + condor 8.8.1



That's curious, do you see any errors in /etc/condor/CollectorLog on 
htc-2.cr.cnaf.infn.it? What's `condor_config_val COLLECTOR_HOST` return 
on the CE? How about `condor_status -schedd` on the central manager?

Thanks,
Brian

On 3/8/19 4:11 PM, Stefano Dal Pra wrote:
> Hello,
>
> On 08/03/19 22:15, Brian Lin wrote:
>> Stefano,
>>
>> On the CE host, is the local condor running and configured as a submit
>> host?
> Yes, condor_submit from ce02-htc does work:
> [sdalpra@ce02-htc htjobs]$ condor_submit test.sub
> Submitting job(s).
> 1 job(s) submitted to cluster 5.
> [sdalpra@ce02-htc htjobs]$ condor_history
> ÂIDÂÂÂÂ OWNERÂÂÂÂÂÂÂÂÂ SUBMITTEDÂÂ RUN_TIMEÂÂÂÂ ST COMPLETED CMD
> ÂÂ 5.0ÂÂ sdalpraÂÂÂÂÂÂÂÂ 3/8Â 22:18ÂÂ 0+00:00:01 CÂÂ 3/8Â 22:18 
> /bin/hostname
>
> looking at ce02-htc:/var/log/condor/SchedLog
> there is the log for the local submission:
> 03/08/19 22:18:45 (pid:1097660) Started shadow for job 5.0 on 
> slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 
> <131.154.194.216:9618?addrs=131.154.194.216-9618&noUDP&sock=21537_2b4a_3> 
> for sdalpra, (shadow pid = 1178025)
>
> However after a grid submission to the CE i see nothing related in 
> /var/log/condor/SchedLog.
>
>> Â This error in the JobRouterLog leads me to believe that there's a
>> communication error between the CE job router and the local HTCondor's
>> schedd:
>>
>> 03/08/19 18:17:32 (D_ALWAYS) ERROR (pool htc-2.cr.cnaf.infn.it:9618)
>> Can't find address of schedd
>
> [root@ce02-htc ~]# condor_ce_config_val -dump JOB_ROUTER_SCHEDD
> # Configuration from machine: ce02-htc.cr.cnaf.infn.it
>
> # Parameters with names that match JOB_ROUTER_SCHEDD:
> JOB_ROUTER_SCHEDD2_NAME = ce02-htc.cr.cnaf.infn.it
> JOB_ROUTER_SCHEDD2_POOL = htc-2.cr.cnaf.infn.it:9618
> JOB_ROUTER_SCHEDD2_SPOOL = /var/lib/condor/spool
>
>> You may find relevant errors in /var/log/condor/SchedLog if there are
>> incompatibilities in the SEC_ configuration between HTCondor-CE and the
>> local HTCondor.
>
> the SEC_settings seems to be identical to the ones of the working ce.
>
> I also tried to "import" ce configuration files from the older one and 
> just adjusting hostnames,
> but still no luck.
>
> Stefano
>
>>
>> We don't set COLLECTOR_PORT explicitly but instead set COLLECTOR_HOST
>> (https://github.com/opensciencegrid/htcondor-ce/blob/master/config/condor_config#L13-L15) 
>>
>> so I believe that's fine.
>>
>> If you're getting HTCondor from the CHTC repositories, the blahp is
>> built-in. It's curious that you have the blahp RPM on your "old CE" but
>> you shouldn't need it.
>>
>> - Brian
>>
>> On 3/8/19 2:43 PM, Stefano Dal Pra wrote:
>>> Hello,
>>>
>>> I would need some help to get working a new HTCondor-CE instance.
>>> So far i have a working test cluster with HTCondor-CE 3.1.0-1.el7 /
>>> condor 8.6.13.
>>>
>>> I'm working to setup a second instance with latest stable releases:
>>>
>>> - ce02-htc.cr.cnaf.infn.it:9619, HTCondor-CE 3.2.1-1.el7 / condor 
>>> 8.6.13
>>> - htc-2.cr.cnaf.infn.it,ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ Central Manager / Collector 
>>> 8.6.13
>>>
>>> However submitted jobs (condor-ce-trace from a user interface) are
>>> going held:
>>>
>>> [root@ce02-htc condor]# condor_ce_q
>>>
>>> -- Schedd: ce02-htc.cr.cnaf.infn.it : <131.154.192.41:19416> @
>>> 03/08/19 17:58:51
>>> OWNERÂÂÂ BATCH_NAMEÂÂÂ SUBMITTEDÂÂ DONEÂÂ RUNÂÂÂ IDLEÂÂ HOLD TOTAL
>>> JOB_IDS
>>> dteam039 ID: 26ÂÂÂÂÂÂ 3/7Â 17:35ÂÂÂÂÂ _ÂÂÂÂÂ _ÂÂÂÂÂ _ÂÂÂÂÂ 1 1 26.0
>>> dteam039 ID: 27ÂÂÂÂÂÂ 3/7Â 17:42ÂÂÂÂÂ _ÂÂÂÂÂ _ÂÂÂÂÂ _ÂÂÂÂÂ 1 1 27.0
>>> dteam039 ID: 28ÂÂÂÂÂÂ 3/8Â 13:15ÂÂÂÂÂ _ÂÂÂÂÂ _ÂÂÂÂÂ _ÂÂÂÂÂ 1 1 28.0
>>> dteam039 ID: 31ÂÂÂÂÂÂ 3/8Â 17:21ÂÂÂÂÂ _ÂÂÂÂÂ _ÂÂÂÂÂ _ÂÂÂÂÂ 1 1 31.0
>>>
>>>
>>> Apparently the match with job router should be ok:
>>>
>>> [root@ce02-htc condor]# condor_ce_config_val -dump JOB_ROUTER_ENTRIES
>>> # Configuration from machine: ce02-htc.cr.cnaf.infn.it
>>>
>>> # Parameters with names that match JOB_ROUTER_ENTRIES:
>>> JOB_ROUTER_ENTRIES = [
>>> name = "condor_pool_dteam";
>>> TargetUniverse = 5;
>>> Requirements = (regexp("dteam", TARGET.x509UserProxyVoName));
>>> set_requirements = (TARGET.Arch == "X86_64") && (TARGET.OpSys ==
>>> "LINUX");
>>> MaxJobs = 100;
>>> MaxIdleJobs = 100;
>>> ]
>>> [SNIP]
>>>
>>> However:
>>> [root@ce02-htc condor]# condor_ce_q -analyze 31
>>>
>>> -- Schedd: ce02-htc.cr.cnaf.infn.it : <131.154.192.41:19416>
>>>
>>> 031.000:Â Job is held.
>>>
>>> Hold reason: HTCondor-CE held job due to no matching routes, route job
>>> limit, or route failure threshold; see 'HTCondor-CE Troubleshooting
>>> Guide'
>>>
>>> Looking into the condor-ce logs i see these errors:
>>>
>>> JobRouterLog:
>>>
>>> 03/08/19 18:17:32 (D_ALWAYS) SECMAN: required authentication with
>>> collector at <131.154.195.32:9618> failed, so aborting command
>>> QUERY_SCHEDD_ADS.
>>> 03/08/19 18:17:32 (D_ALWAYS) ERROR: AUTHENTICATE:1003:Failed to
>>> authenticate with any method|AUTHENTICATE:1004:Failed to authenticate
>>> using FS
>>> 03/08/19 18:17:32 (D_ALWAYS) ERROR (pool htc-2.cr.cnaf.infn.it:9618)
>>> Can't find address of schedd
>>>
>>>
>>> CollectorLog:
>>>
>>> 03/08/19 18:11:34 (D_ALWAYS:2) Trying to update collector
>>> <131.154.192.41:9619>
>>> 03/08/19 18:11:34 (D_ALWAYS:2) Attempting to send update via TCP to
>>> collector ce02-htc.cr.cnaf.infn.it <131.154.192.41:9619>
>>> 03/08/19 18:11:34 (D_ALWAYS:2) Sent ad to 1 collectors for
>>> dteam039@htc_tier1 Hit=4 Tot=4 Idle=0 Run=0
>>> 03/08/19 18:11:34 (D_ALWAYS:2) ============ Begin clean_shadow_recs
>>> =============
>>> 03/08/19 18:11:34 (D_ALWAYS:2) ============ End clean_shadow_recs
>>> =============
>>> 03/08/19 18:11:34 (D_ALWAYS:2) Job 32.0 held for spooling. Checking
>>> how long...
>>> 03/08/19 18:11:34 (D_ALWAYS:2) Attribute StageInStart not set in 32.0.
>>> Set it.
>>> 03/08/19 18:11:34 (D_ALWAYS:2) Sending RESCHEDULE command to
>>> negotiator(s)
>>> 03/08/19 18:11:34 (D_ALWAYS:2) Will use TCP to update collector
>>> ce02-htc.cr.cnaf.infn.it <131.154.192.41:9619>
>>> 03/08/19 18:11:34 (D_ALWAYS:2) Trying to query collector
>>> <131.154.192.41:9619>
>>> 03/08/19 18:11:35 (D_ALWAYS) Can't find address for negotiator
>>> 03/08/19 18:11:35 (D_ALWAYS|D_FAILURE) Failed to send RESCHEDULE to
>>> unknown daemon:
>>> 03/08/19 18:11:35 (cid:19) (D_AUDIT)
>>> Command=SPOOL_JOB_FILES_WITH_PERMS, peer=<131.154.192.239:24028>
>>> 03/08/19 18:11:35 (cid:19) (D_AUDIT) AuthMethod=GSI,
>>> AuthId=/C=IT/O=INFN/OU=Personal Certificate/L=CNAF/CN=Stefano Dal
>>> Pra,/dteam/Role=NULL/Capability=NULL, CondorId=dteam039@htc_tier1
>>> 03/08/19 18:11:35 (D_ALWAYS:2) spoolJobFiles(): read JobAdsArrayLen - 1
>>> [...]
>>> 03/08/19 18:11:40 (D_ALWAYS:2) Sent ad to 1 collectors for
>>> dteam039@htc_tier1 Hit=4 Tot=4 Idle=1 Run=0
>>> 03/08/19 18:11:40 (D_ALWAYS:2) ============ Begin clean_shadow_recs
>>> =============
>>> 03/08/19 18:11:40 (D_ALWAYS:2) ============ End clean_shadow_recs
>>> =============
>>> 03/08/19 18:11:40 (D_ALWAYS:2) Sending RESCHEDULE command to
>>> negotiator(s)
>>> 03/08/19 18:11:40 (D_ALWAYS:2) Will use TCP to update collector
>>> ce02-htc.cr.cnaf.infn.it <131.154.192.41:9619>
>>> 03/08/19 18:11:40 (D_ALWAYS:2) Trying to query collector
>>> <131.154.192.41:9619>
>>> 03/08/19 18:11:40 (D_ALWAYS) Can't find address for negotiator
>>> 03/08/19 18:11:40 (D_ALWAYS|D_FAILURE) Failed to send RESCHEDULE to
>>> unknown daemon:
>>> 03/08/19 18:11:40 (D_ALWAYS:2) ForkWorker::Fork: New child of 1132279
>>> = 1132521
>>>
>>> ###########
>>>
>>> CollectorLog (in the Central Manager/Collector, ce02-htc)
>>>
>>> 03/08/19 21:34:50 DC_AUTHENTICATE: required authentication of
>>> 131.154.192.41 failed: AUTHENTICATE:1003:Failed to aut
>>> henticate with any method|AUTHENTICATE:1004:Failed to authenticate
>>> using PASSWORD|AUTHENTICATE:1004:Failed to authen
>>> ticate using FS|FS:1004:Unable to lstat(/tmp/FS_XXX49WRX5)
>>>
>>> ##############
>>> .
>>> I've been comparing configurations with that of the working
>>> htcondor-ce (ce01-htc.cr.cnaf.infn.it), but i haven't found a solution.
>>>
>>> These are the SEC_* settings
>>> [root@ce02-htc ~]# condor_ce_config_val -dump SEC_ | egrep -v '^#'
>>>
>>> CEVIEW.SEC_CLIENT_AUTHENTICATION_METHODS = FS
>>> CEVIEW.SEC_CLIENT_NEGOTIATION = PREFERRED
>>> MASTER.SEC_DEFAULT_AUTHENTICATION_METHODS = FS, GSI
>>> SCHEDD.SEC_DAEMON_AUTHENTICATION_METHODS = FS,GSI
>>> SCHEDD.SEC_WRITE_AUTHENTICATION_METHODS = FS,GSI
>>> SEC_CLAIMTOBE_INCLUDE_DOMAIN = false
>>> SEC_CLAIMTOBE_USER =
>>> SEC_CLIENT_AUTHENTICATION = OPTIONAL
>>> SEC_CLIENT_AUTHENTICATION_METHODS = GSI,FS
>>> SEC_CLIENT_ENCRYPTION = OPTIONAL
>>> SEC_CLIENT_INTEGRITY = OPTIONAL
>>> SEC_CREDENTIAL_REFRESH_INTERVAL = -1
>>> SEC_DEBUG_PRINT_KEYS = false
>>> SEC_DEFAULT_AUTHENTICATION = REQUIRED
>>> SEC_DEFAULT_AUTHENTICATION_METHODS = GSI
>>> SEC_DEFAULT_AUTHENTICATION_TIMEOUT = 20
>>> SEC_DEFAULT_ENCRYPTION = OPTIONAL
>>> SEC_DEFAULT_INTEGRITY = REQUIRED
>>> SEC_ENABLE_MATCH_PASSWORD_AUTHENTICATION = true
>>> SEC_INVALIDATE_SESSIONS_VIA_TCP = true
>>> SEC_PASSWORD_DOMAIN =
>>> SEC_PASSWORD_FILE =
>>> SEC_READ_AUTHENTICATION = OPTIONAL
>>> SEC_READ_ENCRYPTION = OPTIONAL
>>> SEC_READ_INTEGRITY = OPTIONAL
>>> SEC_SESSION_DURATION_SLOP = 20
>>> SEC_TCP_SESSION_TIMEOUT = 20
>>>
>>>
>>> I tried adding password authentication:
>>> [root@ce02-htc ~]# grep SEC_CLIENT_AUTHENTICATION_METHODS
>>> /etc/condor-ce/config.d/01-common-auth.conf
>>> SEC_CLIENT_AUTHENTICATION_METHODS=FS,GSI,PASSWORD
>>>
>>> but then it seems to be overriden:
>>> [root@ce02-htc ~]# condor_ce_config_val -v
>>> SEC_CLIENT_AUTHENTICATION_METHODS
>>> SEC_CLIENT_AUTHENTICATION_METHODS = GSI,FS
>>> ÂÂ# at: <Environment>
>>> ÂÂ# raw: SEC_CLIENT_AUTHENTICATION_METHODS = GSI,FS
>>>
>>> by the condor_ce_* wrapper commands.
>>>
>>>
>>> A few things that puzzles me:
>>>
>>> [root@ce02-htc ~]# condor_ce_config_val -v COLLECTOR_PORT
>>> COLLECTOR_PORT = 9618
>>> ÂÂ# at: <Default>
>>> ÂÂ# raw: COLLECTOR_PORT = 9618
>>>
>>> [root@ce02-htc ~]# condor_config_val -v COLLECTOR_PORT
>>> COLLECTOR_PORT = 9618
>>> ÂÂ# at: <Default>
>>> ÂÂ# raw: COLLECTOR_PORT = 9618
>>>
>>>
>>> But the other machine has:
>>>
>>> [root@ce01-htc ~]# condor_ce_config_val -v COLLECTOR_PORT
>>> COLLECTOR_PORT = 9619
>>> ÂÂ# at:
>>> /usr/share/condor-ce/config.d/01-common-collector-defaults.conf, 
>>> line 11
>>> ÂÂ# raw: COLLECTOR_PORT = 9619
>>>
>>> [root@ce01-htc ~]# condor_config_val -v COLLECTOR_PORT
>>> COLLECTOR_PORT = 9618
>>> ÂÂ# at: <Default>
>>> ÂÂ# raw: COLLECTOR_PORT = 9618
>>>
>>>
>>> The "older" CE has a blah rpm:
>>> [root@ce01-htc ~]# rpm -qa | grep blah
>>> condor-classads-blah-patch-0.0.1-1.el7.centos.x86_64
>>> blahp-1.18.35.bosco-1.osg34.el7.x86_64
>>>
>>> But i have not found a blahp rpm in the repo for 8.8.1.
>>> How does work a 3.2.1 HTCondor-CE on top of a non condor batch system?
>>>
>>>
>>> Thank You for any help,
>>> Stefano
>>>
>>> _______________________________________________
>>> HTCondor-users mailing list
>>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
>>> with a
>>> subject: Unsubscribe
>>> You can also unsubscribe by visiting
>>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>>
>>> The archives can be found at:
>>> https://lists.cs.wisc.edu/archive/htcondor-users/
>