[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Condor service failed to stop



Sorry, my previous message went to the wrong thread.

Hello Ben, 

BB> Hi Pavel: 

BB> >>> For the first machine: When I start condor service two processes are started - condor_master and condor_startd. But in several seconds after start (10-15 sec) condor_startd dies and condor_master became consume 50% of CPU. After that I can't stop condor service. When I try to do this - I receive an error message about unable to stop service due to exceeded response time. It should be noted that condor_status on the central manager doesn't show this machine in the list neither when the service is "running", nor after my attempt to stop it. 
BB> <<< 

BB> Can you post the master and startd logs? (Preferably with debugging turned up to, say, D_FULLDEBUG.) Also, when the startd dies, does it leave a core file behind? If so, please post that too. 

Here they are. 

MasterLog: 
======== 
3/17 12:16:22 UnsetEnv(NET_REMAP_ENABLE): SetEnvironmentVariable failed, errno=203 
3/17 12:16:22 WARNING: Config source is empty: C:\condor/condor_config.local 
3/17 12:16:22 ****************************************************** 
3/17 12:16:22 ** Condor (CONDOR_MASTER) STARTING UP 
3/17 12:16:22 ** C:\condor\bin\condor_master.exe 
3/17 12:16:22 ** SubsystemInfo: name=MASTER type=MASTER(2) class=DAEMON(1) 
3/17 12:16:22 ** Configuration: subsystem:MASTER local:<NONE> class:DAEMON 
3/17 12:16:22 ** $CondorVersion: 7.2.1 Feb 19 2009 BuildID: 133382 $ 
3/17 12:16:22 ** $CondorPlatform: INTEL-WINNT50 $ 
3/17 12:16:22 ** PID = 3488 
3/17 12:16:22 ** Log last touched time unavailable (No such file or directory) 
3/17 12:16:22 ****************************************************** 
3/17 12:16:22 Using config source: C:\condor\condor_config 
3/17 12:16:22 Using local config sources: 
3/17 12:16:22 C:\condor/condor_config.local 
3/17 12:16:22 DaemonCore: Command Socket at <195.209.147.39:1033> 
3/17 12:16:22 Will use UDP to update collector n37.keldysh.ru <195.209.147.37:9618> 
3/17 12:16:22 Log file not found in config file: AGENTD_LOG 
3/17 12:16:22 Authorized application C:\condor/bin/condor_startd.exe is now enabled in the firewall. 
3/17 12:16:22 Authorized application C:\condor/bin\condor_dagman.exe is now enabled in the firewall. 
3/17 12:16:22 ::RealStart; STARTD on_hold=0 
3/17 12:16:22 GetBinaryType() returned 0 
3/17 12:16:22 Started DaemonCore process "C:\condor/bin/condor_startd.exe", pid and pgroup = 3692 
3/17 12:16:22 ::RealStart; AGENTD on_hold=0 
3/17 12:16:22 GetBinaryType() returned 0 
3/17 12:16:22 Started process "C:\condor/agentd/agentd.exe", pid and pgroup = 3716 
3/17 12:16:22 Getting monitoring info for pid 3488 
3/17 12:16:27 enter Daemons::UpdateCollector 
3/17 12:16:27 Trying to update collector <195.209.147.37:9618> 
3/17 12:16:27 Attempting to send update via UDP to collector n37.keldysh.ru <195.209.147.37:9618> 
3/17 12:16:27 File descriptor limits: max 1024, safe 820 
3/17 12:16:27 exit Daemons::UpdateCollector 
3/17 12:16:27 enter Daemons::CheckForNewExecutable 
3/17 12:16:27 Time stamp of running C:\condor/bin/condor_master.exe: 1234996426 
3/17 12:16:27 GetTimeStamp returned: 1234996426 
3/17 12:16:27 Time stamp of running C:\condor/bin/condor_startd.exe: 1234996492 
3/17 12:16:27 GetTimeStamp returned: 1234996492 
3/17 12:16:27 Time stamp of running C:\condor/agentd/agentd.exe: 1233239202 
3/17 12:16:27 GetTimeStamp returned: 1233239202 
3/17 12:16:27 exit Daemons::CheckForNewExecutable 
3/17 12:16:27 Initialized the following authorization table: 
3/17 12:16:27 Authorizations yet to be resolved: 
3/17 12:16:27 allow NEGOTIATOR: */195.209.147.37 */n37.keldysh.ru 
3/17 12:16:27 allow ADMINISTRATOR: */195.209.147.37 */n37.keldysh.ru 
3/17 12:16:27 allow OWNER: */n39.keldysh.ru */195.209.147.37 */n37.keldysh.ru */195.209.147.39 
3/17 12:16:33 The STARTD (pid 3692) exited with status 0 
3/17 12:16:33 ProcAPI: pid # 3692 was not found (OpenProcess err=720) 
3/17 12:16:33 ProcAPI: pid # 3692 was not found (OpenProcess err=720) 
3/17 12:16:33 restarting C:\condor/bin/condor_startd.exe in 10 seconds 
3/17 12:16:33 enter Daemons::UpdateCollector 
3/17 12:16:33 Trying to update collector <195.209.147.37:9618> 
3/17 12:16:33 Attempting to send update via UDP to collector n37.keldysh.ru <195.209.147.37:9618> 
======== 


StartLog: 
======== 
3/17 12:16:22 WARNING: Config source is empty: C:\condor/condor_config.local 
3/17 12:16:22 ****************************************************** 
3/17 12:16:22 ** condor_startd.exe (CONDOR_STARTD) STARTING UP 
3/17 12:16:22 ** C:\condor\bin\condor_startd.exe 
3/17 12:16:22 ** SubsystemInfo: name=STARTD type=STARTD(7) class=DAEMON(1) 
3/17 12:16:22 ** Configuration: subsystem:STARTD local:<NONE> class:DAEMON 
3/17 12:16:22 ** $CondorVersion: 7.2.1 Feb 19 2009 BuildID: 133382 $ 
3/17 12:16:22 ** $CondorPlatform: INTEL-WINNT50 $ 
3/17 12:16:22 ** PID = 3692 
3/17 12:16:22 ** Log last touched time unavailable (No such file or directory) 
3/17 12:16:22 ****************************************************** 
3/17 12:16:22 Using config source: C:\condor\condor_config 
3/17 12:16:22 Using local config sources: 
3/17 12:16:22 C:\condor/condor_config.local 
3/17 12:16:22 DaemonCore: Command Socket at 
3/17 12:16:22 Will use UDP to update collector n37.keldysh.ru <195.209.147.37:9618> 
3/17 12:16:22 Memory: Detected 3574 megs RAM 
3/17 12:16:22 doInitialize() failed for 
3/17 12:16:22 No usable network interface: hibernation disabled 
3/17 12:16:23 my_popen: CreateProcess failed 
3/17 12:16:23 Failed to execute C:\condor/bin/condor_starter.std.exe, ignoring 
3/17 12:16:23 command_x_event() called. 
3/17 12:16:23 slot1: New machine resource allocated 
3/17 12:16:23 slot2: New machine resource allocated 
3/17 12:16:23 Instantiating a StartdHookMgr 
3/17 12:16:23 UidDomain = "n39.keldysh.ru" 
3/17 12:16:23 FileSystemDomain = "n39.keldysh.ru" 
3/17 12:16:23 Swap space: 4194303 
3/17 12:16:28 no loadavg samples this minute, maybe thread died??? 
3/17 12:16:28 slot1: Total execute space: 32051372 
3/17 12:16:28 slot2: Total execute space: 32051372 
3/17 12:16:28 About to run initial benchmarks. 
3/17 12:16:28 About to compute mips 
3/17 12:16:28 Computed mips: 7297 
3/17 12:16:28 About to compute kflops 
3/17 12:16:33 Computed kflops: 1629489 
3/17 12:16:33 recalc:DHRY_MIPS=7297, CLINPACK KFLOPS=1629489 
3/17 12:16:33 Completed initial benchmarks. 
3/17 12:16:33 CronMgr: Constructing 'startd' 
3/17 12:16:33 CronMgr: Setting name to 'startd' 
3/17 12:16:33 CronMgr: Setting parameter base to 'startd' 
3/17 12:16:33 CronMgr: Doing config (initial) 
3/17 12:16:33 command_x_event() called. 
3/17 12:16:33 slot2: State change: IS_OWNER is false 
3/17 12:16:33 slot2: Changing state: Owner -> Unclaimed 
3/17 12:16:33 slot1: State change: IS_OWNER is false 
3/17 12:16:33 slot1: Changing state: Owner -> Unclaimed 
3/17 12:16:33 ERROR "select, error # = 10038" at line 2719 in file ..\src\condor_daemon_core.V6\daemon_core.cpp 
3/17 12:16:33 CronMgr: 0 jobs alive 
3/17 12:16:33 Deleting Cronmgr 
3/17 12:16:33 StartdCronMgr: Shutting down 
3/17 12:16:33 CronMgr: Killing all jobs 
3/17 12:16:33 StartdCronMgr: Bye 
3/17 12:16:33 CronMgr: bye 
3/17 12:16:33 About to send final update to the central manager 
3/17 12:16:33 Trying to update collector <195.209.147.37:9618> 
3/17 12:16:33 Attempting to send update via UDP to collector n37.keldysh.ru <195.209.147.37:9618> 
3/17 12:16:33 Initialized the following authorization table: 
3/17 12:16:33 Authorizations yet to be resolved: 
3/17 12:16:33 allow READ: */* 
3/17 12:16:33 allow WRITE: */* 
3/17 12:16:33 allow NEGOTIATOR: */195.209.147.37 */n37.keldysh.ru 
3/17 12:16:33 allow ADMINISTRATOR: */195.209.147.37 */n37.keldysh.ru 
3/17 12:16:33 allow OWNER: */n39.keldysh.ru */195.209.147.37 */n37.keldysh.ru */195.209.147.39 
3/17 12:16:33 allow DAEMON: */* 
3/17 12:16:33 allow ADVERTISE_STARTD: */* 
3/17 12:16:33 allow ADVERTISE_SCHEDD: */* 
3/17 12:16:33 allow ADVERTISE_MASTER: */* 
3/17 12:16:33 Trying to update collector <195.209.147.37:9618> 
3/17 12:16:33 Attempting to send update via UDP to collector n37.keldysh.ru <195.209.147.37:9618> 
3/17 12:16:33 Deleting the StartdHookMgr 
3/17 12:16:33 All resources are free, exiting. 
3/17 12:16:33 **** condor_startd.exe (condor_STARTD) pid 3692 EXITING WITH STATUS 0 
======== 


When the startd dies, ".startd_address", ".startd_claim_id.slot1" and ".startd_claim_id.slot2" files disappear. 

BB> Is there something you are doing in your configuration file that is different than the other machines? 
BB> Regards, -B 


There is a one difference - java location. Also I found the message in the StarterLog "3/17 12:16:23 JavaDetect: failure status 1 when executing C:\PROGRA~1\Java\jre6\bin\JAVA.EXE -Xmx1024m1787m -classpath C:\condor/lib;C:\condor/lib/scimark2lib.jar;. CondorJavaInfo old 2". 
condor_config contains "JAVA = C:\PROGRA~1\Java\jre6\bin\JAVA.EXE". My Java is located in "C:\Program Files\Java\jre6\bin". I tryed to change original "JAVA = C:\PROGRA~1\Java\jre6\bin\JAVA.EXE" in the condor_config to "C:\Program Files\Java\jre6\bin" but without success. 

Thanks for response. 
-- 
Pavel