[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] condor_master problem



Hello Santanu

Condor works perfectly well on Scientific Linux 4.4 - this is the
version we have installed. 
You really should consider starting Condor using Derek's condor.boot
file, which can be found in examples/ folder. I normally would copy
it to /etc/init.d/condor and change the MASTER line in it.
Then just add the service to the runlevels 3 and 5 either manually 
or via system-config-services (RedHat style, love it or hate it...)

We also run condor daemons as user condor, but you do not have to 
export CONDOR_IDS for that - I have it undefined in the condor_config file.
Instead, we have condor_config file, which contains this -

CONDOR_ROOT             = /hpc/condor
RELEASE_DIR             = $(CONDOR_ROOT)/releases/x86
LOCAL_DIR               = $(TILDE)
#  Where is the machine-specific local config file for each host?
LOCAL_CONFIG_FILE       = $(CONDOR_ROOT)/hosts/$(HOSTNAME)
REQUIRE_LOCAL_CONFIG_FILE = TRUE

/hpc/condor - is shared out to all clients, while on each machine
condor has its own home folder, with condor_config linked to the
central one -
# ll ~condor
total 1
lrwxrwxrwx 1 condor root  25 Sep 13 12:44 condor_config ->
/hpc/condor/condor_config
drwxr-xr-x 2 condor root  48 Sep 13 12:47 execute
drwxr-xr-x 2 condor root 552 Mar 12 16:40 log
drwxr-xr-x 3 condor root 256 Mar 12 12:13 spool

Each "local config file" sits in a shared folder on the server and is
actually a symbolic link to one of just a few real config files, 
which reflects different architecture and setup.
This way we have the central condor server a view server and many clients
running happily Debian (Ubuntu), RedHat, ScientificLinux, SuSE and SLES
in both i386 and 64-bit. On all of them condor master is started by pretty
much
the same /etc/init.d/condor file - the only difference is in MASTER,
specifying the platform, 32 or 64 bit version of binaries.

cheers,

Andrey Kaliazin
Senior Server Engineer (cluster computing)
Information Systems Aston (ISA)
Aston University, Aston Triangle,
Birmingham, B4 7ET 
Tel: 0121 204 3465 
 

> -----Original Message-----
> From: condor-users-bounces@xxxxxxxxxxx 
> [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Santanu Das
> Sent: Monday, March 12, 2007 8:14 PM
> To: Condor-Users Mail List
> Subject: Re: [Condor-users] condor_master problem
> 
> Hi Nicolas,
> 
> Thanks for sharing this info but unfortunately still doesn't 
> work on Scientific Linux. First of all, I never used 
> condor.boot before. This time I tried (following the 
> instruction written inside) and didn't work as well. I'm 
> exporting CONDOR_CONFIG and PATH from  /etc/profile.d/ but 
> whatever I do, Condor just not starting until I run 
> condor_master by hand. I must have missed some silly part(s). 
> 1/3 of my nodes are okay, just newly installed nodes are 
> driving me crazy.
> 
> Cheers,
> Santanu
> 
> 
> Nicolas GUIOT wrote: 
> 
> 	You said it started well when you run them by yourself 
> ? Maybe the PATH is not set when the daemon runs : that 
> happens on my debian boxes. I have to add in the condor.boot file : 
> 	
> 	export CONDOR_CONFIG=/nfs/condor/etc/condor_config
> 	PATH=/nfs/opt/condor/bin:/nfs/opt/condor/sbin
> 	
> 	MASTER=/nfs/opt/condor/sbin/condor_master
> 	PS="/bin/ps auwx"
> 	GREP="/bin/grep"
> 	AWK="/usr/bin/awk"
> 	
> 	
> 	Hope this helps...
> 	Nicolas
> 	
> 	----------------
> 	On Mon, 12 Mar 2007 16:22:44 +0000
> 	Santanu Das <santanu@xxxxxxxxxxxxxxxxx> 
> <mailto:santanu@xxxxxxxxxxxxxxxxx>  wrote:
> 	
> 	  
> 
> 		Hi Steve,
> 		
> 		Thanks for replying. I tried that but didn't do 
> quite well. Even if I 
> 		delete the file or even I don't, running 
> CONDOR_MASTER start condor 
> 		nicely but still don't start automatically if I 
> reboot. Anything else am 
> 		I missing?
> 		
> 		Cheers,
> 		Santanu
> 		
> 		
> 		Steven Timm wrote:
> 		    
> 
> 			Remove that lock file in /tmp that is 
> mentioned in the error message
> 			below, and condor will start.
> 			
> 			Steve
> 			
> 			
> 			
> ------------------------------------------------------------------
> 			Steven C. Timm, Ph.D  (630) 840-8525
> 			timm@xxxxxxxx  http://home.fnal.gov/~timm/
> 			Fermilab Computing Division, Scientific 
> Computing Facilities,
> 			Grid Facilities Department, FermiGrid 
> Services Group, Assistant Group Leader.
> 			
> 			On Sat, 10 Mar 2007, Santanu Das wrote:
> 			
> 			  
> 			      
> 
> 				Hi,
> 				I'm still having the same 
> problem - condor_master just doesn't start
> 				automatically after boot. Dose 
> anybody know anything about it? Thanks in
> 				advance for your help.
> 				
> 				Cheers,
> 				Santanu
> 				
> 				Santanu Das wrote:
> 				    
> 				        
> 
> 					Hi all,
> 					
> 					We have a ~150 CPU 
> condor cluster; most of them are dual core Xeon and
> 					few of them are with 
> single core Xeon. Recently I upgraded to
> 					condor-6.8.4 and since 
> then I see a problem, mostly on the all dual
> 					core nodes. I start 
> condor from the "rc.local" and the problem I see
> 					now Condor is not 
> starting automatically on boot, in spite of having
> 					"condor_master" in the 
> rc.local file. If  I run condor_master by hand
> 					from the console, 
> condor starts and every thing goes fine after that.
> 					For some reason, I run 
> condor here as a different user (*NOT* as
> 					default "condor" user), 
> but don't think that's the problem.
> 					CONDOR_IDS is correct 
> in the local config file. There are no such
> 					significant difference  
> (from the configuration point of view) among
> 					the nodes; all are 
> almost identically configured (apart from that
> 					dual-core and 
> single-core issue). I just see these in the MasterLog:
> 					
> 					3/8 17:56:03 
> ******************************************************
> 					3/8 17:56:03 ** 
> condor_master (CONDOR_MASTER) STARTING UP
> 					3/8 17:56:03 ** 
> /opt/condor-6.8.4/sbin/condor_master
> 					3/8 17:56:03 ** 
> $CondorVersion: 6.8.4 Feb  1 2007 $
> 					3/8 17:56:03 ** 
> $CondorPlatform: I386-LINUX_RH9 $
> 					3/8 17:56:03 ** PID = 3216
> 					3/8 17:56:03 ** Log 
> last touched 3/8 17:56:02
> 					3/8 17:56:03 
> ******************************************************
> 					3/8 17:56:03 Using 
> config source: /opt/condor/etc/condor_config
> 					3/8 17:56:03 Using 
> local config sources:
> 					3/8 17:56:03    
> /home/condorr/condor_config.local
> 					3/8 17:56:03 
> FileLock::obtain(1) failed - errno 11 (Resource
> 					temporarily unavailable)
> 					3/8 17:56:03 ERROR 
> "Can't get lock on
> 					
> "/tmp/condor-lock.farm0420.21308906360446/InstanceLock"" at line 978
> 					in file master.C
> 					3/8 18:08:57 Got 
> SIGTERM. Performing graceful shutdown.
> 					3/8 18:08:57 SafeMsg: 
> sending small msg failed. errno: 22
> 					3/8 18:08:57 
> Send_Signal: ERROR sending signal 15 to pid 3181
> 					3/8 18:08:57 ERROR: 
> failed to send SIGTERM to pid 3181
> 					3/8 18:08:57 The STARTD 
> (pid 3181) exited with status 0
> 					3/8 18:08:57 All 
> daemons are gone.  Exiting.
> 					3/8 18:08:57 **** 
> condor_master (condor_MASTER) EXITING WITH STATUS 0
> 					3/8 18:12:11 
> passwd_cache::cache_uid(): getpwnam("condor") failed:
> 					Success
> 					
> 					3/8 18:12:11 
> passwd_cache::cache_uid(): getpwnam("condor") failed:
> 					Success
> 					
> 					Any idea what might be 
> the problem or what am I missing?
> 					
> 					Cheers,
> 					Santanu
> 					HEP, Cavendish Laboratory
> 					Cambridge
> 					
> 					      
> 					          
> 
> 				
> _______________________________________________
> 				Condor-users mailing list
> 				To unsubscribe, send a message 
> to condor-users-request@xxxxxxxxxxx with a
> 				subject: Unsubscribe
> 				You can also unsubscribe by visiting
> 				
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> 				
> 				The archives can be found at either
> 				
> https://lists.cs.wisc.edu/archive/condor-users/
> 				
> http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR
> 				
> 				    
> 				        
> 
> 			_______________________________________________
> 			Condor-users mailing list
> 			To unsubscribe, send a message to 
> condor-users-request@xxxxxxxxxxx with a
> 			subject: Unsubscribe
> 			You can also unsubscribe by visiting
> 			
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> 			
> 			The archives can be found at either
> 			https://lists.cs.wisc.edu/archive/condor-users/
> 			
> http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR
> 			  
> 			      
> 
> 	
> 	----------
> 	
> 	
> 	----------------------------------------------------
> 	CNRS - UPR 9080 : Laboratoire de Biochimie Theorique
> 	
> 	Institut de Biologie Physico-Chimique
> 	13 rue Pierre et Marie Curie
> 	75005 PARIS - FRANCE
> 	
> 	Tel : +33 158 41 51 70
> 	Fax : +33 158 41 50 26
> 	----------------------------------------------------
> 	_______________________________________________
> 	Condor-users mailing list
> 	To unsubscribe, send a message to 
> condor-users-request@xxxxxxxxxxx with a
> 	subject: Unsubscribe
> 	You can also unsubscribe by visiting
> 	https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> 	
> 	The archives can be found at either
> 	https://lists.cs.wisc.edu/archive/condor-users/
> 	
> http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR
> 	  
> 
> 
>