Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] condor_master problem

Date: Fri, 09 Mar 2007 10:27:02 +0000
From: Santanu Das <santanu@xxxxxxxxxxxxxxxxx>
Subject: [Condor-users] condor_master problem

Hi all,

We have a ~150 CPU condor cluster; most of them are dual core Xeon andfew of them are with single core Xeon. Recently I upgraded tocondor-6.8.4 and since then I see a problem, mostly on the all dual corenodes. I start condor from the "rc.local" and the problem I see nowCondor is not starting automatically on boot, in spite of having"condor_master" in the rc.local file. If I run condor_master by handfrom the console, condor starts and every thing goes fine after that.For some reason, I run condor here as a different user (*NOT* as default"condor" user), but don't think that's the problem. CONDOR_IDS iscorrect in the local config file. There are no such significantdifference (from the configuration point of view) among the nodes; allare almost identically configured (apart from that dual-core andsingle-core issue). I just see these in the MasterLog:


3/8 17:56:03 ******************************************************
3/8 17:56:03 ** condor_master (CONDOR_MASTER) STARTING UP
3/8 17:56:03 ** /opt/condor-6.8.4/sbin/condor_master
3/8 17:56:03 ** $CondorVersion: 6.8.4 Feb  1 2007 $
3/8 17:56:03 ** $CondorPlatform: I386-LINUX_RH9 $
3/8 17:56:03 ** PID = 3216
3/8 17:56:03 ** Log last touched 3/8 17:56:02
3/8 17:56:03 ******************************************************
3/8 17:56:03 Using config source: /opt/condor/etc/condor_config
3/8 17:56:03 Using local config sources:
3/8 17:56:03    /home/condorr/condor_config.local

3/8 17:56:03 FileLock::obtain(1) failed - errno 11 (Resource temporarilyunavailable)3/8 17:56:03 ERROR "Can't get lock on"/tmp/condor-lock.farm0420.21308906360446/InstanceLock"" at line 978 infile master.C

3/8 18:08:57 Got SIGTERM. Performing graceful shutdown.
3/8 18:08:57 SafeMsg: sending small msg failed. errno: 22
3/8 18:08:57 Send_Signal: ERROR sending signal 15 to pid 3181
3/8 18:08:57 ERROR: failed to send SIGTERM to pid 3181
3/8 18:08:57 The STARTD (pid 3181) exited with status 0
3/8 18:08:57 All daemons are gone.  Exiting.
3/8 18:08:57 **** condor_master (condor_MASTER) EXITING WITH STATUS 0
3/8 18:12:11 passwd_cache::cache_uid(): getpwnam("condor") failed: Success

3/8 18:12:11 passwd_cache::cache_uid(): getpwnam("condor") failed: Success

Any idea what might be the problem or what am I missing?

Cheers,
Santanu
HEP, Cavendish Laboratory
Cambridge

Follow-Ups:
- Re: [Condor-users] condor_master problem
  - From: Santanu Das

References:
- [Condor-users] Cygwin Perl scripting under Condor in Windows
  - From: Alan Cass

Prev by Date: Re: [Condor-users] Vanilla job crashing segmentation fault due to ioctl [Sec=Unclassified]
Next by Date: Re: [Condor-users] Using Dagman
Previous by thread: [Condor-users] Cygwin Perl scripting under Condor in Windows
Next by thread: Re: [Condor-users] condor_master problem
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

[Condor-users] condor_master problem