[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] condor_master problem



Hi Nicolas,

Thanks for sharing this info but unfortunately still doesn't work on Scientific Linux. First of all, I never used condor.boot before. This time I tried (following the instruction written inside) and didn't work as well. I'm exporting CONDOR_CONFIG and PATH from  /etc/profile.d/ but whatever I do, Condor just not starting until I run condor_master by hand. I must have missed some silly part(s). 1/3 of my nodes are okay, just newly installed nodes are driving me crazy.

Cheers,
Santanu


Nicolas GUIOT wrote:
You said it started well when you run them by yourself ? Maybe the PATH is not set when the daemon runs : that happens on my debian boxes. I have to add in the condor.boot file : 

export CONDOR_CONFIG=/nfs/condor/etc/condor_config
PATH=/nfs/opt/condor/bin:/nfs/opt/condor/sbin

MASTER=/nfs/opt/condor/sbin/condor_master
PS="/bin/ps auwx"
GREP="/bin/grep"
AWK="/usr/bin/awk"


Hope this helps...
Nicolas

----------------
On Mon, 12 Mar 2007 16:22:44 +0000
Santanu Das <santanu@xxxxxxxxxxxxxxxxx> wrote:

  
Hi Steve,

Thanks for replying. I tried that but didn't do quite well. Even if I 
delete the file or even I don't, running CONDOR_MASTER start condor 
nicely but still don't start automatically if I reboot. Anything else am 
I missing?

Cheers,
Santanu


Steven Timm wrote:
    
Remove that lock file in /tmp that is mentioned in the error message
below, and condor will start.

Steve


------------------------------------------------------------------
Steven C. Timm, Ph.D  (630) 840-8525
timm@xxxxxxxx  http://home.fnal.gov/~timm/
Fermilab Computing Division, Scientific Computing Facilities,
Grid Facilities Department, FermiGrid Services Group, Assistant Group Leader.

On Sat, 10 Mar 2007, Santanu Das wrote:

  
      
Hi,
I'm still having the same problem - condor_master just doesn't start
automatically after boot. Dose anybody know anything about it? Thanks in
advance for your help.

Cheers,
Santanu

Santanu Das wrote:
    
        
Hi all,

We have a ~150 CPU condor cluster; most of them are dual core Xeon and
few of them are with single core Xeon. Recently I upgraded to
condor-6.8.4 and since then I see a problem, mostly on the all dual
core nodes. I start condor from the "rc.local" and the problem I see
now Condor is not starting automatically on boot, in spite of having
"condor_master" in the rc.local file. If  I run condor_master by hand
from the console, condor starts and every thing goes fine after that.
For some reason, I run condor here as a different user (*NOT* as
default "condor" user), but don't think that's the problem.
CONDOR_IDS is correct in the local config file. There are no such
significant difference  (from the configuration point of view) among
the nodes; all are almost identically configured (apart from that
dual-core and single-core issue). I just see these in the MasterLog:

3/8 17:56:03 ******************************************************
3/8 17:56:03 ** condor_master (CONDOR_MASTER) STARTING UP
3/8 17:56:03 ** /opt/condor-6.8.4/sbin/condor_master
3/8 17:56:03 ** $CondorVersion: 6.8.4 Feb  1 2007 $
3/8 17:56:03 ** $CondorPlatform: I386-LINUX_RH9 $
3/8 17:56:03 ** PID = 3216
3/8 17:56:03 ** Log last touched 3/8 17:56:02
3/8 17:56:03 ******************************************************
3/8 17:56:03 Using config source: /opt/condor/etc/condor_config
3/8 17:56:03 Using local config sources:
3/8 17:56:03    /home/condorr/condor_config.local
3/8 17:56:03 FileLock::obtain(1) failed - errno 11 (Resource
temporarily unavailable)
3/8 17:56:03 ERROR "Can't get lock on
"/tmp/condor-lock.farm0420.21308906360446/InstanceLock"" at line 978
in file master.C
3/8 18:08:57 Got SIGTERM. Performing graceful shutdown.
3/8 18:08:57 SafeMsg: sending small msg failed. errno: 22
3/8 18:08:57 Send_Signal: ERROR sending signal 15 to pid 3181
3/8 18:08:57 ERROR: failed to send SIGTERM to pid 3181
3/8 18:08:57 The STARTD (pid 3181) exited with status 0
3/8 18:08:57 All daemons are gone.  Exiting.
3/8 18:08:57 **** condor_master (condor_MASTER) EXITING WITH STATUS 0
3/8 18:12:11 passwd_cache::cache_uid(): getpwnam("condor") failed:
Success

3/8 18:12:11 passwd_cache::cache_uid(): getpwnam("condor") failed:
Success

Any idea what might be the problem or what am I missing?

Cheers,
Santanu
HEP, Cavendish Laboratory
Cambridge

      
          
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at either
https://lists.cs.wisc.edu/archive/condor-users/
http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR

    
        
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at either
https://lists.cs.wisc.edu/archive/condor-users/
http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR
  
      

----------


----------------------------------------------------
CNRS - UPR 9080 : Laboratoire de Biochimie Theorique

Institut de Biologie Physico-Chimique
13 rue Pierre et Marie Curie
75005 PARIS - FRANCE

Tel : +33 158 41 51 70
Fax : +33 158 41 50 26
----------------------------------------------------
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at either
https://lists.cs.wisc.edu/archive/condor-users/
http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR