[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Help configuring condor for PBS



I have been trying to get condor to work with a PBS scheduler using kerberized ssh and can't get the gahp server to work.

I can successfully add the machine using bosco_cluser -a mpotts@poi pbsÂÂÂÂ (note, I used a fully qualified address, I just removed the trailing section for this post)

However, when I run bosco_cluster -t mpotts@poi, the test fails with the following--

Testing ssh to mpotts@xxxxxxxxxxxx!
Testing bosco submission...Passed!
Submission and log files for this job are in /data/users/mpotts/condor-scratch/bosco-test/boscotest.2IHjF
Waiting for jobmanager to accept job...Passed
Checking for submission to remote pbs cluster (could take ~30 seconds)...Failed
Showing last 5 lines of logs:
07/18/19 15:31:29 [38206] Gahp Server (pid=38233) exited with status 127 unexpectedly
07/18/19 15:31:31 [38206] gahp server not up yet, delaying ping
07/18/19 15:31:31 [38206] No jobs left, shutting down
07/18/19 15:31:31 [38206] Got SIGTERM. Performing graceful shutdown.
07/18/19 15:31:31 [38206] **** condor_gridmanager (condor_GRIDMANAGER) pid 38206 EXITING WITH STATUS 0

The gridmanager log on the submitting machine shows this--

07/18/19 15:31:23 ******************************************************
07/18/19 15:31:23 ** condor_gridmanager (CONDOR_GRIDMANAGER) STARTING UP
07/18/19 15:31:23 ** /s4data/users/mpotts/condor/sbin/condor_gridmanager
07/18/19 15:31:23 ** SubsystemInfo: name=GRIDMANAGER type=DAEMON(12) class=DAEMON(1) 07/18/19 15:31:23 ** Configuration: subsystem:GRIDMANAGER local:<NONE> class:DAEMON
07/18/19 15:31:23 ** $CondorVersion: 8.8.4 Jul 09 2019 BuildID: 474941 $
07/18/19 15:31:23 ** $CondorPlatform: x86_64_RedHat6 $
07/18/19 15:31:23 ** PID = 38206
07/18/19 15:31:23 ** Log last touched 7/18 15:26:32
07/18/19 15:31:23 ******************************************************
07/18/19 15:31:23 Using config source: /s4data/users/mpotts/condor/etc/condor_config
07/18/19 15:31:23 Using local config sources:
07/18/19 15:31:23 /data/users/mpotts/condor-scratch/config/condor_config.bosco_routing 07/18/19 15:31:23 /data/users/mpotts/condor-scratch/config/condor_config.factory
07/18/19 15:31:23 /data/users/mpotts/condor-scratch/condor_config.local
07/18/19 15:31:23 config Macros = 91, Sorted = 91, StringBytes = 3166, TablesBytes = 3340
07/18/19 15:31:23 CLASSAD_CACHING is ENABLED
07/18/19 15:31:23 Daemon Log is logging: D_ALWAYS D_ERROR
07/18/19 15:31:23 SharedPortEndpoint: waiting for connections to named socket 12713_355e_10 07/18/19 15:31:23 DaemonCore: command socket at <127.0.0.1:11000?addrs=127.0.0.1-11000&noUDP&sock=12713_355e_10> 07/18/19 15:31:23 DaemonCore: private command socket at <127.0.0.1:11000?addrs=127.0.0.1-11000&noUDP&sock=12713_355e_10>
07/18/19 15:31:26 [38206] Found job 20.0 --- inserting
07/18/19 15:31:26 [38206] gahp server not up yet, delaying ping
07/18/19 15:31:26 [38206] (20.0) doEvaluateState called: gmState GM_INIT, remoteState 0
07/18/19 15:31:26 [38206] GAHP server pid = 38233
07/18/19 15:31:29 [38206] Failed to read GAHP server version
07/18/19 15:31:29 [38206] (20.0) Error starting GAHP
07/18/19 15:31:29 [38206] Gahp Server (pid=38233) exited with status 127 unexpectedly
07/18/19 15:31:31 [38206] gahp server not up yet, delaying ping
07/18/19 15:31:31 [38206] No jobs left, shutting down
07/18/19 15:31:31 [38206] Got SIGTERM. Performing graceful shutdown.
07/18/19 15:31:31 [38206] **** condor_gridmanager (condor_GRIDMANAGER) pid 38206 EXITING WITH STATUS 0

Does anyone have any suggestions on how to get this to connect? I was able to successfully connect to a slurm-based resource using the same approach, so I am not sure what is going on or how to debug.

Thanks!

-Mark

--
Mark A. Potts, Ph.D.
Sr. HPC Software Developer
RedLine Performance Solutions, LLC
Phone 202-744-9469
Mark.Potts@xxxxxxxx
mpotts@xxxxxxxxxxxxxxx


Attachment: smime.p7s
Description: S/MIME Cryptographic Signature