[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] All jobs idle



$CondorVersion: 8.6.10 Mar 12 2018 BuildID: 435200 $

$CondorPlatform: x86_64_RedHat7 $


OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS

jfisher CMD: ngspice 4/9 15:17 _ _ 2700 2700 294.0-2699

2700 jobs; 0 completed, 0 removed, 2700 idle, 0 running, 0 held, 0 suspended


Can anyone help? I recently updated my Condor version and I'm now having trouble getting it to work. Caveat â there were other OS (CentOS) packages I updated at the same time.

I have 48 slots all are reported Unclaimed and Idle. This is just a rerun of something that ran ok a few months ago â so I'm a bit lost.

My CollectorLog looks like this:


04/09/18 16:46:07 Got QUERY_STARTD_PVT_ADS

04/09/18 16:46:07 Number of Active Workers 0

04/09/18 16:46:07 (Sending 48 ads in response to query)

04/09/18 16:46:07 Query info: matched=48; skipped=0; query_time=0.001078; send_time=0.005897; type=MachinePrivate; requirements={true}; peer=<192.168.1.206:27405>; projection={}

04/09/18 16:46:07 Number of Active Workers 0

04/09/18 16:46:07 (Sending 52 ads in response to query)

04/09/18 16:46:07 Query info: matched=52; skipped=9; query_time=0.001561; send_time=0.018082; type=Any; requirements={( ( ( MyType == "Scheduler" ) || ( MyType == "Submitter" ) ) || ( ( MyType == "Machine" ) ) )}; peer=<192.168.1.206:28613>; projection={}

04/09/18 16:46:07 DaemonCore: Can't receive command request from 192.168.1.206 (perhaps a timeout?)



192.168.1.206 is the master machine and it's the machine I was using to start the jobs. (It's also the machine I'm writing this email on, so it's definitely available)



MasterLog looks like this:

04/09/18 16:14:26 ******************************************************

04/09/18 16:14:26 ** condor_master (CONDOR_MASTER) STARTING UP

04/09/18 16:14:26 ** /usr/sbin/condor_master

04/09/18 16:14:26 ** SubsystemInfo: name=MASTER type=MASTER(2) class=DAEMON(1)

04/09/18 16:14:26 ** Configuration: subsystem:MASTER local:<NONE> class:DAEMON

04/09/18 16:14:26 ** $CondorVersion: 8.6.10 Mar 12 2018 BuildID: 435200 $

04/09/18 16:14:26 ** $CondorPlatform: x86_64_RedHat7 $

04/09/18 16:14:26 ** PID = 1437

04/09/18 16:14:26 ** Log last touched 4/9 15:53:29

04/09/18 16:14:26 ******************************************************

04/09/18 16:14:26 Using config source: /etc/condor/condor_config

04/09/18 16:14:26 Using local config sources:

04/09/18 16:14:26 /etc/condor/config.d/00master.config

04/09/18 16:14:26 /etc/condor/condor_config.local

04/09/18 16:14:26 config Macros = 86, Sorted = 86, StringBytes = 2342, TablesBytes = 3144

04/09/18 16:14:26 CLASSAD_CACHING is OFF

04/09/18 16:14:26 Daemon Log is logging: D_ALWAYS D_ERROR

04/09/18 16:14:30 SharedPortEndpoint: waiting for connections to named socket 1437_3daf

04/09/18 16:14:30 SharedPortEndpoint: failed to open /var/lock/condor/shared_port_ad: No such file or directory

04/09/18 16:14:30 SharedPortEndpoint: did not successfully find SharedPortServer address. Will retry in 60s.

04/09/18 16:14:30 DaemonCore: private command socket at <192.168.1.206:0?sock=1437_3daf>

04/09/18 16:14:30 Master restart (GRACEFUL) is watching /usr/sbin/condor_master (mtime:1520895141)

04/09/18 16:14:31 Started DaemonCore process "/usr/libexec/condor/condor_shared_port", pid and pgroup = 1926

04/09/18 16:14:31 Waiting for /var/lock/condor/shared_port_ad to appear.

04/09/18 16:14:32 Found /var/lock/condor/shared_port_ad.

04/09/18 16:14:33 Started DaemonCore process "/usr/sbin/condor_collector", pid and pgroup = 1979

04/09/18 16:14:33 Waiting for /var/log/condor/.collector_address to appear.

04/09/18 16:14:34 Waiting for /var/log/condor/.collector_address to appear.

04/09/18 16:14:35 Found /var/log/condor/.collector_address.

04/09/18 16:14:36 Started DaemonCore process "/usr/sbin/condor_negotiator", pid and pgroup = 1987

04/09/18 16:14:37 Started DaemonCore process "/usr/sbin/condor_schedd", pid and pgroup = 1990

I can't restart Condor:


sudo condor_restart

ERROR

SECMAN:2010:Received "DENIED" from server for user unauthenticated@unmapped using no authentication method, which may imply host-based security. Our address was '192.168.1.206', and server's address was '192.168.1.206'. Check your ALLOW settings and IP protocols.

Can't send Restart command to local master

I don't know what an ALLOW setting is and Google doesn't point me to anything useful so far, unless it's this from condor_config:

ALLOW_NEGOTIATOR = 192.168.1.206

ALLOW_NEGOTIATOR_SCHEDD = 192.168.1.206


SharedPortLog looks like this:

04/09/18 16:39:31 About to update statistics in shared_port daemon ad file at /var/lock/condor/shared_port_ad :

ForkedChildrenPeak = 0

RequestsBlocked = 0

ForkedChildrenCurrent = 0

RequestsSucceeded = 94

RequestsPendingPeak = 3

RequestsPendingCurrent = 0

RequestsFailed = 0

SharedPortCommandSinfuls = "<192.168.1.206:9618>,<[::1]:9618>"

MyAddress = "<192.168.1.206:9618?addrs=192.168.1.206-9618+[--1]-9618&noUDP>"


MatchLog

04/09/18 16:43:07 Rejected 295.0 group_ANALOG.jfisher@xxxxxxxxxxxxxx <192.168.1.206:9618?addrs=192.168.1.206-9618+[--1]-9618&noUDP&sock=1437_3daf_4>: no match found

04/09/18 16:44:07 Rejected 295.0 group_ANALOG.jfisher@xxxxxxxxxxxxxx <192.168.1.206:9618?addrs=192.168.1.206-9618+[--1]-9618&noUDP&sock=1437_3daf_4>: no match found




SchedLog

04/09/18 16:44:07 (pid:1990) Activity on stashed negotiator socket: <192.168.1.206:26537>

04/09/18 16:44:07 (pid:1990) Using negotiation protocol: NEGOTIATE

04/09/18 16:44:07 (pid:1990) Negotiating for owner: group_ANALOG.jfisher@xxxxxxxxxxxxxx

04/09/18 16:44:07 (pid:1990) Finished negotiating for group_ANALOG.jfisher in local pool: 0 matched, 1 rejected





NegotiatorLog

04/09/18 16:45:07 ---------- Started Negotiation Cycle ----------

04/09/18 16:45:07 Phase 1: Obtaining ads from collector ...

04/09/18 16:45:07 Getting startd private ads ...

04/09/18 16:45:07 Getting Scheduler, Submitter and Machine ads ...

04/09/18 16:45:07 Sorting 52 ads ...

04/09/18 16:45:07 Got ads: 52 public and 48 private

04/09/18 16:45:07 Public ads include 1 submitter, 48 startd

04/09/18 16:45:07 Phase 2: Performing accounting ...

04/09/18 16:45:07 Phase 3: Sorting submitter ads by priority ...

04/09/18 16:45:07 Phase 4.1: Negotiating with schedds ...

04/09/18 16:45:07 Negotiating with group_ANALOG.jfisher@xxxxxxxxxxxxxx at <192.168.1.206:9618?addrs=192.168.1.206-9618+[--1]-9618&noUDP&sock=1437_3daf_4>

04/09/18 16:45:07 0 seconds so far for this submitter

04/09/18 16:45:07 0 seconds so far for this schedd

04/09/18 16:45:07 Got NO_MORE_JOBS; schedd has no more requests

04/09/18 16:45:07 Request 00295.00000: autocluster 1 (request count 1 of 2700)

04/09/18 16:45:07 Rejected 295.0 group_ANALOG.jfisher@xxxxxxxxxxxxxx <192.168.1.206:9618?addrs=192.168.1.206-9618+[--1]-9618&noUDP&sock=1437_3daf_4>: no match found

04/09/18 16:45:07 Got NO_MORE_JOBS; schedd has no more requests

04/09/18 16:45:07 negotiateWithGroup resources used scheddAds length 0

04/09/18 16:45:07 ---------- Finished Negotiation Cycle ----------



--
Kind regards,

Justin Fisher.