[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] condor_master problem



You said it started well when you run them by yourself ? Maybe the PATH is not set when the daemon runs : that happens on my debian boxes. I have to add in the condor.boot file : 

export CONDOR_CONFIG=/nfs/condor/etc/condor_config
PATH=/nfs/opt/condor/bin:/nfs/opt/condor/sbin

MASTER=/nfs/opt/condor/sbin/condor_master
PS="/bin/ps auwx"
GREP="/bin/grep"
AWK="/usr/bin/awk"


Hope this helps...
Nicolas

----------------
On Mon, 12 Mar 2007 16:22:44 +0000
Santanu Das <santanu@xxxxxxxxxxxxxxxxx> wrote:

> Hi Steve,
> 
> Thanks for replying. I tried that but didn't do quite well. Even if I 
> delete the file or even I don't, running CONDOR_MASTER start condor 
> nicely but still don't start automatically if I reboot. Anything else am 
> I missing?
> 
> Cheers,
> Santanu
> 
> 
> Steven Timm wrote:
> > Remove that lock file in /tmp that is mentioned in the error message
> > below, and condor will start.
> >
> > Steve
> >
> >
> > ------------------------------------------------------------------
> > Steven C. Timm, Ph.D  (630) 840-8525
> > timm@xxxxxxxx  http://home.fnal.gov/~timm/
> > Fermilab Computing Division, Scientific Computing Facilities,
> > Grid Facilities Department, FermiGrid Services Group, Assistant Group Leader.
> >
> > On Sat, 10 Mar 2007, Santanu Das wrote:
> >
> >   
> >> Hi,
> >> I'm still having the same problem - condor_master just doesn't start
> >> automatically after boot. Dose anybody know anything about it? Thanks in
> >> advance for your help.
> >>
> >> Cheers,
> >> Santanu
> >>
> >> Santanu Das wrote:
> >>     
> >>> Hi all,
> >>>
> >>> We have a ~150 CPU condor cluster; most of them are dual core Xeon and
> >>> few of them are with single core Xeon. Recently I upgraded to
> >>> condor-6.8.4 and since then I see a problem, mostly on the all dual
> >>> core nodes. I start condor from the "rc.local" and the problem I see
> >>> now Condor is not starting automatically on boot, in spite of having
> >>> "condor_master" in the rc.local file. If  I run condor_master by hand
> >>> from the console, condor starts and every thing goes fine after that.
> >>> For some reason, I run condor here as a different user (*NOT* as
> >>> default "condor" user), but don't think that's the problem.
> >>> CONDOR_IDS is correct in the local config file. There are no such
> >>> significant difference  (from the configuration point of view) among
> >>> the nodes; all are almost identically configured (apart from that
> >>> dual-core and single-core issue). I just see these in the MasterLog:
> >>>
> >>> 3/8 17:56:03 ******************************************************
> >>> 3/8 17:56:03 ** condor_master (CONDOR_MASTER) STARTING UP
> >>> 3/8 17:56:03 ** /opt/condor-6.8.4/sbin/condor_master
> >>> 3/8 17:56:03 ** $CondorVersion: 6.8.4 Feb  1 2007 $
> >>> 3/8 17:56:03 ** $CondorPlatform: I386-LINUX_RH9 $
> >>> 3/8 17:56:03 ** PID = 3216
> >>> 3/8 17:56:03 ** Log last touched 3/8 17:56:02
> >>> 3/8 17:56:03 ******************************************************
> >>> 3/8 17:56:03 Using config source: /opt/condor/etc/condor_config
> >>> 3/8 17:56:03 Using local config sources:
> >>> 3/8 17:56:03    /home/condorr/condor_config.local
> >>> 3/8 17:56:03 FileLock::obtain(1) failed - errno 11 (Resource
> >>> temporarily unavailable)
> >>> 3/8 17:56:03 ERROR "Can't get lock on
> >>> "/tmp/condor-lock.farm0420.21308906360446/InstanceLock"" at line 978
> >>> in file master.C
> >>> 3/8 18:08:57 Got SIGTERM. Performing graceful shutdown.
> >>> 3/8 18:08:57 SafeMsg: sending small msg failed. errno: 22
> >>> 3/8 18:08:57 Send_Signal: ERROR sending signal 15 to pid 3181
> >>> 3/8 18:08:57 ERROR: failed to send SIGTERM to pid 3181
> >>> 3/8 18:08:57 The STARTD (pid 3181) exited with status 0
> >>> 3/8 18:08:57 All daemons are gone.  Exiting.
> >>> 3/8 18:08:57 **** condor_master (condor_MASTER) EXITING WITH STATUS 0
> >>> 3/8 18:12:11 passwd_cache::cache_uid(): getpwnam("condor") failed:
> >>> Success
> >>>
> >>> 3/8 18:12:11 passwd_cache::cache_uid(): getpwnam("condor") failed:
> >>> Success
> >>>
> >>> Any idea what might be the problem or what am I missing?
> >>>
> >>> Cheers,
> >>> Santanu
> >>> HEP, Cavendish Laboratory
> >>> Cambridge
> >>>
> >>>       
> >> _______________________________________________
> >> Condor-users mailing list
> >> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> >> subject: Unsubscribe
> >> You can also unsubscribe by visiting
> >> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> >>
> >> The archives can be found at either
> >> https://lists.cs.wisc.edu/archive/condor-users/
> >> http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR
> >>
> >>     
> > _______________________________________________
> > Condor-users mailing list
> > To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> > subject: Unsubscribe
> > You can also unsubscribe by visiting
> > https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> >
> > The archives can be found at either
> > https://lists.cs.wisc.edu/archive/condor-users/
> > http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR
> >   
> 

----------


----------------------------------------------------
CNRS - UPR 9080 : Laboratoire de Biochimie Theorique

Institut de Biologie Physico-Chimique
13 rue Pierre et Marie Curie
75005 PARIS - FRANCE

Tel : +33 158 41 51 70
Fax : +33 158 41 50 26
----------------------------------------------------