[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Failing to start: Debian 8 (assertion error)



I finally fixed it.

(1) There were a load of old libraries and binaries left around from the previous install - e.g. old libraries in /usr/lib instead of /usr/lib/condor and vice versa.

The debian utility "cruft" was very useful in identifying these:

cruft -d '/usr /var /etc'

(Moral: always provide a --prefix to ./configure when building from source)

(2) After removing these, condor still wouldn't start until I manually removed the shared_port_ad file:

# rm /var/lock/condor/shared_port_ad
# service condor stop
# service condor start

I got the clue from looking at MasterLog on the working Ubuntu machines:

...
01/04/16 16:24:12 SharedPortEndpoint: waiting for connections to named socket 11207_f65f
01/04/16 16:24:12 SharedPortEndpoint: failed to open /var/lock/condor/shared_port_ad: No such file or directory
01/04/16 16:24:12 SharedPortEndpoint: did not successfully find SharedPortServer address. Will retry in 60s.
01/04/16 16:24:12 DaemonCore: private command socket at <192.168.5.41:0?sock=11207_f65f>
01/04/16 16:24:12 Adding SHARED_PORT to DAEMON_LIST, because USE_SHARED_PORT=true (to disable this, set AUTO_INCLUDE_SHARED_PORT_IN_DAEMON_LIST=False)
01/04/16 16:24:12 Master restart (GRACEFUL) is watching /usr/sbin/condor_master (mtime:1450633351)
01/04/16 16:24:12 Collector port not defined, will use default: 9618
01/04/16 16:24:13 Started DaemonCore process "/usr/lib/condor/libexec/condor_shared_port", pid and pgroup = 11216
01/04/16 16:24:13 Waiting for /var/lock/condor/shared_port_ad to appear.
01/04/16 16:24:14 Found /var/lock/condor/shared_port_ad.
01/04/16 16:24:14 Started DaemonCore process "/usr/sbin/condor_startd", pid and pgroup = 11217
...

That is: it looks like the master daemon reads this file, and if it doesn't exist then it waits for it to be created. But if it was already created with junk, it can read the bad data and die.

Cheers,

Brian.