[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] SCHEDD not running right on upgraded CE with Condor 7.6.6




We use Rocks to install Condor RPM.

We have the following line in /etc/sysconfig/condor to point to the system wide
configuration file:
CONDOR_CONFIG="/share/apps/condor/etc/condor_config_7.6.6"

The condor_config_7.6.6 is attached.

Did not see any alarming errors in either MasterLog and SchedLog files. Both
are attached as well.

BTW, did not see neither Schedd_Event_Log nor ShadowLog files which lead us
to believe that it's not accepting jobs.

Thanks.....

Steven.....


On 04/03/2012 06:50 PM, Alain Roy wrote:
On Apr 3, 2012, at 8:37 PM, Steven Lo wrote:
Hi,

We just upgraded Condor from 7.4.1 to 7.6.6 on one of our CE.

When we do a condor_q, the following error pops out:

# condor_q
Error:

Extra Info: You probably saw this error because the condor_schedd is not
running on the machine you are trying to query.
We did see that both schedd and startd are running:

condor    6518  6490  0 17:42 ?        00:00:00 condor_startd -f
condor    6520  6490  0 17:42 ?        00:00:00 condor_schedd -f
That's interesting. How did you install Condor? Do you have CONDOR_CONFIG set? Are there errors in the MasterLog or the SchedLog?

-alain
------------------------------
Alain Roy
Condor Project
roy@xxxxxxxxxxx
http://www.cs.wisc.edu/condor


_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/

Attachment: condor_config_7.6.6
Description: Unix manual page

04/03/12 17:23:27 Setting maximum accepts per cycle 4.
04/03/12 17:23:27 ******************************************************
04/03/12 17:23:27 ** condor_startd (CONDOR_STARTD) STARTING UP
04/03/12 17:23:27 ** /usr/sbin/condor_startd
04/03/12 17:23:27 ** SubsystemInfo: name=STARTD type=STARTD(7) class=DAEMON(1)
04/03/12 17:23:27 ** Configuration: subsystem:STARTD local:<NONE> class:DAEMON
04/03/12 17:23:27 ** $CondorVersion: 7.6.6 Jan 17 2012 BuildID: 401976 $
04/03/12 17:23:27 ** $CondorPlatform: x86_64_rhap_5 $
04/03/12 17:23:27 ** PID = 16225
04/03/12 17:23:27 ** Log last touched time unavailable (No such file or directory)
04/03/12 17:23:27 ******************************************************
04/03/12 17:23:27 Using config source: /share/apps/condor/etc/condor_config_7.6.6
04/03/12 17:23:27 Using local config sources: 
04/03/12 17:23:27    /share/apps/condor/hosts/cithep252/condor_config.local
04/03/12 17:23:27 DaemonCore: command socket at <10.3.255.253:56762>
04/03/12 17:23:27 DaemonCore: private command socket at <10.3.255.253:56762>
04/03/12 17:23:27 Setting maximum accepts per cycle 4.
04/03/12 17:23:32 VM-gahp server reported an internal error
04/03/12 17:23:32 VM universe will be tested to check if it is available
04/03/12 17:23:32 History file rotation is enabled.
04/03/12 17:23:32   Maximum history file size is: 20971520 bytes
04/03/12 17:23:32   Number of rotated history files is: 2
04/03/12 17:23:32 slot1: New machine resource allocated
04/03/12 17:23:32 slot2: New machine resource allocated
04/03/12 17:23:32 slot3: New machine resource allocated
04/03/12 17:23:32 slot4: New machine resource allocated
04/03/12 17:23:32 slot5: New machine resource allocated
04/03/12 17:23:32 slot6: New machine resource allocated
04/03/12 17:23:32 slot7: New machine resource allocated
04/03/12 17:23:32 slot8: New machine resource allocated
04/03/12 17:23:32 CronJobList: Adding job 'MIPS'
04/03/12 17:23:32 CronJobList: Adding job 'KFLOPS'
04/03/12 17:23:32 CronJob: Initializing job 'MIPS' (/usr/libexec/condor/condor_mips)
04/03/12 17:23:32 CronJob: Initializing job 'KFLOPS' (/usr/libexec/condor/condor_kflops)
04/03/12 17:39:35 Got SIGTERM. Performing graceful shutdown.
04/03/12 17:39:35 shutdown graceful
04/03/12 17:39:35 Cron: Killing all jobs
04/03/12 17:39:35 Cron: Killing all jobs
04/03/12 17:39:35 Killing job MIPS
04/03/12 17:39:35 Killing job KFLOPS
04/03/12 17:39:35 Deleting cron job manager
04/03/12 17:39:35 Cron: Killing all jobs
04/03/12 17:39:35 Cron: Killing all jobs
04/03/12 17:39:35 CronJobList: Deleting all jobs
04/03/12 17:39:35 Cron: Killing all jobs
04/03/12 17:39:35 CronJobList: Deleting all jobs
04/03/12 17:39:35 Deleting benchmark job mgr
04/03/12 17:39:35 Cron: Killing all jobs
04/03/12 17:39:35 Killing job MIPS
04/03/12 17:39:35 Killing job KFLOPS
04/03/12 17:39:35 Cron: Killing all jobs
04/03/12 17:39:35 Killing job MIPS
04/03/12 17:39:35 Killing job KFLOPS
04/03/12 17:39:35 CronJobList: Deleting all jobs
04/03/12 17:39:35 CronJobList: Deleting job 'MIPS'
04/03/12 17:39:35 CronJob: Deleting job 'MIPS' (/usr/libexec/condor/condor_mips), timer -1
04/03/12 17:39:35 CronJobList: Deleting job 'KFLOPS'
04/03/12 17:39:35 CronJob: Deleting job 'KFLOPS' (/usr/libexec/condor/condor_kflops), timer -1
04/03/12 17:39:35 Cron: Killing all jobs
04/03/12 17:39:35 CronJobList: Deleting all jobs
04/03/12 17:39:35 SafeMsg: sending small msg failed. errno: 101
04/03/12 17:39:35 SafeMsg: sending small msg failed. errno: 101
04/03/12 17:39:35 SafeMsg: sending small msg failed. errno: 101
04/03/12 17:39:35 SafeMsg: sending small msg failed. errno: 101
04/03/12 17:39:35 SafeMsg: sending small msg failed. errno: 101
04/03/12 17:39:35 SafeMsg: sending small msg failed. errno: 101
04/03/12 17:39:35 SafeMsg: sending small msg failed. errno: 101
04/03/12 17:39:35 SafeMsg: sending small msg failed. errno: 101
04/03/12 17:39:35 All resources are free, exiting.
04/03/12 17:39:35 **** condor_startd (condor_STARTD) pid 16225 EXITING WITH STATUS 0
04/03/12 17:42:22 Setting maximum accepts per cycle 4.
04/03/12 17:42:22 ******************************************************
04/03/12 17:42:22 ** condor_startd (CONDOR_STARTD) STARTING UP
04/03/12 17:42:22 ** /usr/sbin/condor_startd
04/03/12 17:42:22 ** SubsystemInfo: name=STARTD type=STARTD(7) class=DAEMON(1)
04/03/12 17:42:22 ** Configuration: subsystem:STARTD local:<NONE> class:DAEMON
04/03/12 17:42:22 ** $CondorVersion: 7.6.6 Jan 17 2012 BuildID: 401976 $
04/03/12 17:42:22 ** $CondorPlatform: x86_64_rhap_5 $
04/03/12 17:42:22 ** PID = 6518
04/03/12 17:42:22 ** Log last touched 4/3 17:39:35
04/03/12 17:42:22 ******************************************************
04/03/12 17:42:22 Using config source: /share/apps/condor/etc/condor_config_7.6.6
04/03/12 17:42:22 Using local config sources: 
04/03/12 17:42:22    /share/apps/condor/hosts/cithep252/condor_config.local
04/03/12 17:42:22 DaemonCore: command socket at <10.3.255.253:50008>
04/03/12 17:42:22 DaemonCore: private command socket at <10.3.255.253:50008>
04/03/12 17:42:22 Setting maximum accepts per cycle 4.
04/03/12 17:42:30 VM-gahp server reported an internal error
04/03/12 17:42:30 VM universe will be tested to check if it is available
04/03/12 17:42:30 History file rotation is enabled.
04/03/12 17:42:30   Maximum history file size is: 20971520 bytes
04/03/12 17:42:30   Number of rotated history files is: 2
04/03/12 17:42:30 slot1: New machine resource allocated
04/03/12 17:42:30 slot2: New machine resource allocated
04/03/12 17:42:30 slot3: New machine resource allocated
04/03/12 17:42:30 slot4: New machine resource allocated
04/03/12 17:42:30 slot5: New machine resource allocated
04/03/12 17:42:30 slot6: New machine resource allocated
04/03/12 17:42:30 slot7: New machine resource allocated
04/03/12 17:42:30 slot8: New machine resource allocated
04/03/12 17:42:30 CronJobList: Adding job 'MIPS'
04/03/12 17:42:30 CronJobList: Adding job 'KFLOPS'
04/03/12 17:42:30 CronJob: Initializing job 'MIPS' (/usr/libexec/condor/condor_mips)
04/03/12 17:42:30 CronJob: Initializing job 'KFLOPS' (/usr/libexec/condor/condor_kflops)
04/03/12 17:23:27 (pid:16226) Setting maximum accepts per cycle 4.
04/03/12 17:23:27 (pid:16226) ******************************************************
04/03/12 17:23:27 (pid:16226) ** condor_schedd (CONDOR_SCHEDD) STARTING UP
04/03/12 17:23:27 (pid:16226) ** /usr/sbin/condor_schedd
04/03/12 17:23:27 (pid:16226) ** SubsystemInfo: name=SCHEDD type=SCHEDD(5) class=DAEMON(1)
04/03/12 17:23:27 (pid:16226) ** Configuration: subsystem:SCHEDD local:<NONE> class:DAEMON
04/03/12 17:23:27 (pid:16226) ** $CondorVersion: 7.6.6 Jan 17 2012 BuildID: 401976 $
04/03/12 17:23:27 (pid:16226) ** $CondorPlatform: x86_64_rhap_5 $
04/03/12 17:23:27 (pid:16226) ** PID = 16226
04/03/12 17:23:27 (pid:16226) ** Log last touched time unavailable (No such file or directory)
04/03/12 17:23:27 (pid:16226) ******************************************************
04/03/12 17:23:27 (pid:16226) Using config source: /share/apps/condor/etc/condor_config_7.6.6
04/03/12 17:23:27 (pid:16226) Using local config sources: 
04/03/12 17:23:27 (pid:16226)    /share/apps/condor/hosts/cithep252/condor_config.local
04/03/12 17:23:27 (pid:16226) DaemonCore: command socket at <10.3.255.253:32919>
04/03/12 17:23:27 (pid:16226) DaemonCore: private command socket at <10.3.255.253:32919>
04/03/12 17:23:27 (pid:16226) Setting maximum accepts per cycle 4.
04/03/12 17:23:27 (pid:16226) History file rotation is enabled.
04/03/12 17:23:27 (pid:16226)   Maximum history file size is: 20971520 bytes
04/03/12 17:23:27 (pid:16226)   Number of rotated history files is: 2
04/03/12 17:23:27 (pid:16226) Logging per-job history files to: /osg/1.2.8/gratia/var/data
04/03/12 17:23:32 (pid:16226) TransferQueueManager stats: active up=0/10 down=0/10; waiting up=0 down=0; wait time up=0s down=0s
04/03/12 17:28:33 (pid:16226) TransferQueueManager stats: active up=0/10 down=0/10; waiting up=0 down=0; wait time up=0s down=0s
04/03/12 17:33:34 (pid:16226) TransferQueueManager stats: active up=0/10 down=0/10; waiting up=0 down=0; wait time up=0s down=0s
04/03/12 17:38:35 (pid:16226) TransferQueueManager stats: active up=0/10 down=0/10; waiting up=0 down=0; wait time up=0s down=0s
04/03/12 17:39:35 (pid:16226) Got SIGTERM. Performing graceful shutdown.
04/03/12 17:39:35 (pid:16226) Deleting CronJobMgr
04/03/12 17:39:35 (pid:16226) Cron: Killing all jobs
04/03/12 17:39:35 (pid:16226) Cron: Killing all jobs
04/03/12 17:39:35 (pid:16226) CronJobList: Deleting all jobs
04/03/12 17:39:35 (pid:16226) Cron: Killing all jobs
04/03/12 17:39:35 (pid:16226) CronJobList: Deleting all jobs
04/03/12 17:39:35 (pid:16226) sendMsg:sendto failed - errno: 101
04/03/12 17:39:35 (pid:16226) All shadows are gone, exiting.
04/03/12 17:39:35 (pid:16226) error reading from named pipe: watchdog pipe has closed
04/03/12 17:39:35 (pid:16226) ProcFamilyClient: failed to read response from ProcD
04/03/12 17:39:35 (pid:16226) error telling ProcD to exit
04/03/12 17:39:35 (pid:16226) **** condor_schedd (condor_SCHEDD) pid 16226 EXITING WITH STATUS 0
04/03/12 17:42:23 (pid:6520) Setting maximum accepts per cycle 4.
04/03/12 17:42:23 (pid:6520) ******************************************************
04/03/12 17:42:23 (pid:6520) ** condor_schedd (CONDOR_SCHEDD) STARTING UP
04/03/12 17:42:23 (pid:6520) ** /usr/sbin/condor_schedd
04/03/12 17:42:23 (pid:6520) ** SubsystemInfo: name=SCHEDD type=SCHEDD(5) class=DAEMON(1)
04/03/12 17:42:23 (pid:6520) ** Configuration: subsystem:SCHEDD local:<NONE> class:DAEMON
04/03/12 17:42:23 (pid:6520) ** $CondorVersion: 7.6.6 Jan 17 2012 BuildID: 401976 $
04/03/12 17:42:23 (pid:6520) ** $CondorPlatform: x86_64_rhap_5 $
04/03/12 17:42:23 (pid:6520) ** PID = 6520
04/03/12 17:42:23 (pid:6520) ** Log last touched 4/3 17:39:35
04/03/12 17:42:23 (pid:6520) ******************************************************
04/03/12 17:42:23 (pid:6520) Using config source: /share/apps/condor/etc/condor_config_7.6.6
04/03/12 17:42:23 (pid:6520) Using local config sources: 
04/03/12 17:42:23 (pid:6520)    /share/apps/condor/hosts/cithep252/condor_config.local
04/03/12 17:42:23 (pid:6520) DaemonCore: command socket at <10.3.255.253:48116>
04/03/12 17:42:23 (pid:6520) DaemonCore: private command socket at <10.3.255.253:48116>
04/03/12 17:42:23 (pid:6520) Setting maximum accepts per cycle 4.
04/03/12 17:42:23 (pid:6520) History file rotation is enabled.
04/03/12 17:42:23 (pid:6520)   Maximum history file size is: 20971520 bytes
04/03/12 17:42:23 (pid:6520)   Number of rotated history files is: 2
04/03/12 17:42:23 (pid:6520) Logging per-job history files to: /osg/1.2.8/gratia/var/data
04/03/12 17:42:28 (pid:6520) TransferQueueManager stats: active up=0/10 down=0/10; waiting up=0 down=0; wait time up=0s down=0s
04/03/12 17:47:28 (pid:6520) TransferQueueManager stats: active up=0/10 down=0/10; waiting up=0 down=0; wait time up=0s down=0s
04/03/12 17:52:28 (pid:6520) TransferQueueManager stats: active up=0/10 down=0/10; waiting up=0 down=0; wait time up=0s down=0s
04/03/12 17:57:28 (pid:6520) TransferQueueManager stats: active up=0/10 down=0/10; waiting up=0 down=0; wait time up=0s down=0s
04/03/12 18:02:28 (pid:6520) TransferQueueManager stats: active up=0/10 down=0/10; waiting up=0 down=0; wait time up=0s down=0s
04/03/12 18:07:28 (pid:6520) TransferQueueManager stats: active up=0/10 down=0/10; waiting up=0 down=0; wait time up=0s down=0s
04/03/12 18:12:28 (pid:6520) TransferQueueManager stats: active up=0/10 down=0/10; waiting up=0 down=0; wait time up=0s down=0s
04/03/12 18:17:28 (pid:6520) TransferQueueManager stats: active up=0/10 down=0/10; waiting up=0 down=0; wait time up=0s down=0s
04/03/12 18:22:28 (pid:6520) TransferQueueManager stats: active up=0/10 down=0/10; waiting up=0 down=0; wait time up=0s down=0s
04/03/12 18:27:29 (pid:6520) TransferQueueManager stats: active up=0/10 down=0/10; waiting up=0 down=0; wait time up=0s down=0s
04/03/12 18:32:30 (pid:6520) TransferQueueManager stats: active up=0/10 down=0/10; waiting up=0 down=0; wait time up=0s down=0s
04/03/12 18:37:31 (pid:6520) TransferQueueManager stats: active up=0/10 down=0/10; waiting up=0 down=0; wait time up=0s down=0s
04/03/12 18:42:32 (pid:6520) TransferQueueManager stats: active up=0/10 down=0/10; waiting up=0 down=0; wait time up=0s down=0s
04/03/12 18:47:33 (pid:6520) TransferQueueManager stats: active up=0/10 down=0/10; waiting up=0 down=0; wait time up=0s down=0s
04/03/12 18:52:34 (pid:6520) TransferQueueManager stats: active up=0/10 down=0/10; waiting up=0 down=0; wait time up=0s down=0s
04/03/12 18:57:35 (pid:6520) TransferQueueManager stats: active up=0/10 down=0/10; waiting up=0 down=0; wait time up=0s down=0s
04/03/12 19:02:36 (pid:6520) TransferQueueManager stats: active up=0/10 down=0/10; waiting up=0 down=0; wait time up=0s down=0s
04/03/12 19:07:37 (pid:6520) TransferQueueManager stats: active up=0/10 down=0/10; waiting up=0 down=0; wait time up=0s down=0s
04/03/12 19:12:38 (pid:6520) TransferQueueManager stats: active up=0/10 down=0/10; waiting up=0 down=0; wait time up=0s down=0s
04/03/12 17:23:26 Setting maximum accepts per cycle 4.
04/03/12 17:23:26 ******************************************************
04/03/12 17:23:26 ** condor_master (CONDOR_MASTER) STARTING UP
04/03/12 17:23:26 ** /usr/sbin/condor_master
04/03/12 17:23:26 ** SubsystemInfo: name=MASTER type=MASTER(2) class=DAEMON(1)
04/03/12 17:23:26 ** Configuration: subsystem:MASTER local:<NONE> class:DAEMON
04/03/12 17:23:26 ** $CondorVersion: 7.6.6 Jan 17 2012 BuildID: 401976 $
04/03/12 17:23:26 ** $CondorPlatform: x86_64_rhap_5 $
04/03/12 17:23:26 ** PID = 16224
04/03/12 17:23:26 ** Log last touched time unavailable (No such file or directory)
04/03/12 17:23:26 ******************************************************
04/03/12 17:23:26 Using config source: /share/apps/condor/etc/condor_config_7.6.6
04/03/12 17:23:26 Using local config sources: 
04/03/12 17:23:26    /share/apps/condor/hosts/cithep252/condor_config.local
04/03/12 17:23:26 DaemonCore: command socket at <10.3.255.253:55402>
04/03/12 17:23:26 DaemonCore: private command socket at <10.3.255.253:55402>
04/03/12 17:23:26 Setting maximum accepts per cycle 4.
04/03/12 17:23:26 Started DaemonCore process "/usr/sbin/condor_startd", pid and pgroup = 16225
04/03/12 17:23:27 Started DaemonCore process "/usr/sbin/condor_schedd", pid and pgroup = 16226
04/03/12 17:39:35 Got SIGTERM. Performing graceful shutdown.
04/03/12 17:39:35 SafeMsg: sending small msg failed. errno: 101
04/03/12 17:39:35 Sent SIGTERM to SCHEDD (pid 16226)
04/03/12 17:39:35 Sent SIGTERM to STARTD (pid 16225)
04/03/12 17:39:35 The STARTD (pid 16225) exited with status 0
04/03/12 17:39:35 The SCHEDD (pid 16226) exited with status 0
04/03/12 17:39:35 All daemons are gone.  Exiting.
04/03/12 17:39:35 **** condor_master (condor_MASTER) pid 16224 EXITING WITH STATUS 0
04/03/12 17:42:22 Setting maximum accepts per cycle 4.
04/03/12 17:42:22 ******************************************************
04/03/12 17:42:22 ** condor_master (CONDOR_MASTER) STARTING UP
04/03/12 17:42:22 ** /usr/sbin/condor_master
04/03/12 17:42:22 ** SubsystemInfo: name=MASTER type=MASTER(2) class=DAEMON(1)
04/03/12 17:42:22 ** Configuration: subsystem:MASTER local:<NONE> class:DAEMON
04/03/12 17:42:22 ** $CondorVersion: 7.6.6 Jan 17 2012 BuildID: 401976 $
04/03/12 17:42:22 ** $CondorPlatform: x86_64_rhap_5 $
04/03/12 17:42:22 ** PID = 6490
04/03/12 17:42:22 ** Log last touched 4/3 17:39:35
04/03/12 17:42:22 ******************************************************
04/03/12 17:42:22 Using config source: /share/apps/condor/etc/condor_config_7.6.6
04/03/12 17:42:22 Using local config sources: 
04/03/12 17:42:22    /share/apps/condor/hosts/cithep252/condor_config.local
04/03/12 17:42:22 DaemonCore: command socket at <10.3.255.253:46860>
04/03/12 17:42:22 DaemonCore: private command socket at <10.3.255.253:46860>
04/03/12 17:42:22 Setting maximum accepts per cycle 4.
04/03/12 17:42:22 Started DaemonCore process "/usr/sbin/condor_startd", pid and pgroup = 6518
04/03/12 17:42:22 Started DaemonCore process "/usr/sbin/condor_schedd", pid and pgroup = 6520
04/03/12 18:42:22 Preen pid is 24588

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature