[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Idle Jobs



I have a user here that has submitted 20,000 jobs. I have updated most of our pool to 8.6 including the CM and submit node. Jobs are running only on a machine that is still on 8.4.10. My first thought is to downgrade back to 8.4.10. Any help is appreciated.

Here is a snip of the StartLog on one of the machines.

02/07/17 09:01:07 Starter pid 144129 exited with status 0
02/07/17 09:01:07 slot1: State change: starter exited
02/07/17 09:01:07 slot1: Changing activity: Busy -> Idle
02/07/17 09:01:07 slot1: Got activate_claim request from shadow (CM IP)
02/07/17 09:01:07 slot1: Remote job ID is 956.82
02/07/17 09:01:07 slot1: Got universe "VANILLA" (5) from request classad
02/07/17 09:01:07 slot1: State change: claim-activation protocol successful
02/07/17 09:01:07 slot1: Changing activity: Idle -> Busy
02/07/17 09:01:09 slot1: Called deactivate_claim()
02/07/17 09:01:09 Starter pid 144155 exited with status 0
02/07/17 09:01:09 slot1: State change: starter exited
02/07/17 09:01:09 slot1: Changing activity: Busy -> Idle
02/07/17 09:01:10 slot1: State change: received RELEASE_CLAIM command
02/07/17 09:01:10 slot1: Changing state and activity: Claimed/Idle -> Preempting/Vacating
02/07/17 09:01:10 slot1: State change: No preempting claim, returning to owner
02/07/17 09:01:10 slot1: Changing state and activity: Preempting/Vacating -> Owner/Idle
02/07/17 09:01:10 slot1: State change: IS_OWNER is false
02/07/17 09:01:10 slot1: Changing state: Owner -> Unclaimed
02/07/17 09:21:52 Unable to calculate keyboard/mouse idle time due to them both being USB or not present, assuming infinite idle time for these devices.

MasterLog

02/03/17 13:51:15 Reconfiguring all managed daemons.
02/03/17 13:51:15 Sent SIGHUP to SHARED_PORT (pid 4535)
02/03/17 13:51:15 Sent SIGHUP to STARTD (pid 4536)
02/04/17 12:13:10 Preen pid is 95024
02/04/17 12:13:10 DefaultReaper unexpectedly called on pid 95024, status 0.
02/05/17 12:13:10 Preen pid is 111714
02/05/17 12:13:10 DefaultReaper unexpectedly called on pid 111714, status 0.
02/06/17 12:13:10 Preen pid is 129082
02/06/17 12:13:10 DefaultReaper unexpectedly called on pid 129082, status 0.

SchedLog Â(the last entriesÂare from Jan. 30)

01/30/17 10:57:45 (pid:5370) ******************************************************
01/30/17 10:57:45 (pid:5370) ** condor_schedd (CONDOR_SCHEDD) STARTING UP
01/30/17 10:57:45 (pid:5370) ** /usr/sbin/condor_schedd
01/30/17 10:57:45 (pid:5370) ** SubsystemInfo: name=SCHEDD type=SCHEDD(5) class=DAEMON(1)
01/30/17 10:57:45 (pid:5370) ** Configuration: subsystem:SCHEDD local:<NONE> class:DAEMON
01/30/17 10:57:45 (pid:5370) ** $CondorVersion: 8.6.0 Jan 26 2017 BuildID: 395190 $
01/30/17 10:57:45 (pid:5370) ** $CondorPlatform: x86_64_RedHat7 $
01/30/17 10:57:45 (pid:5370) ** PID = 5370
01/30/17 10:57:45 (pid:5370) ** Log last touched 4/7 13:45:10
01/30/17 10:57:45 (pid:5370) ******************************************************
01/30/17 10:57:45 (pid:5370) Using config source: /etc/condor/condor_config
01/30/17 10:57:45 (pid:5370) Using local config sources:
01/30/17 10:57:45 (pid:5370) Â Â/etc/condor/config.d/01condor_config_IP
01/30/17 10:57:45 (pid:5370) Â Â/etc/condor/config.d/01condor_config_IP_Host
01/30/17 10:57:45 (pid:5370) Â Â/etc/condor/config.d/02condor_config_Access
01/30/17 10:57:45 (pid:5370) Â Â/etc/condor/config.d/03condor_config_flocking
01/30/17 10:57:45 (pid:5370) Â Â/etc/condor/config.d/04condor_config_Docker
01/30/17 10:57:45 (pid:5370) Â Â/etc/condor/condor_config.local
01/30/17 10:57:45 (pid:5370) config Macros = 67, Sorted = 67, StringBytes = 1989, TablesBytes = 2492
01/30/17 10:57:45 (pid:5370) CLASSAD_CACHING is ENABLED
01/30/17 10:57:45 (pid:5370) Daemon Log is logging: D_ALWAYS D_ERROR
01/30/17 10:57:45 (pid:5370) SharedPortEndpoint: waiting for connections to named socket 4190_91e9_14
01/30/17 10:57:45 (pid:5370) DaemonCore: command socket at <x.x.x.x:9618?addrs=x.x.x.x-9618+[--1]-9618&noUDP&sock=4190_91e9_14>
01/30/17 10:57:45 (pid:5370) DaemonCore: private command socket at <x.x.x.x:9618?addrs=x.x.x.x-9618+[--1]-9618&noUDP&sock=4190_91e9_14>
01/30/17 10:57:45 (pid:5370) History file rotation is enabled.
01/30/17 10:57:45 (pid:5370) Â Maximum history file size is: 20971520 bytes
01/30/17 10:57:45 (pid:5370) Â Number of rotated history files is: 2
01/30/17 10:57:45 (pid:5370) my_popenv: Failed to exec in child, errno=2 (No such file or directory)
01/30/17 10:57:45 (pid:5370) Failed to execute /usr/sbin/condor_shadow.std, ignoring
01/30/17 10:57:51 (pid:5370) TransferQueueManager stats: active up=0/10 down=0/10; waiting up=0 down=0; wait time up=0s down=0s
01/30/17 10:57:51 (pid:5370) TransferQueueManager upload 1m I/O load: 0 bytes/s Â0.000 disk load Â0.000 net load
01/30/17 10:57:51 (pid:5370) TransferQueueManager download 1m I/O load: 0 bytes/s Â0.000 disk load Â0.000 net load
01/30/17 10:58:04 (pid:5370) Got SIGTERM. Performing graceful shutdown.
01/30/17 10:58:04 (pid:5370) Deleting CronJobMgr
01/30/17 10:58:04 (pid:5370) Cron: Killing all jobs
01/30/17 10:58:04 (pid:5370) Cron: Killing all jobs
01/30/17 10:58:04 (pid:5370) CronJobList: Deleting all jobs
01/30/17 10:58:04 (pid:5370) Cron: Killing all jobs
01/30/17 10:58:04 (pid:5370) CronJobList: Deleting all jobs
01/30/17 10:58:04 (pid:5370) All shadows are gone, exiting.
01/30/17 10:58:13 (pid:5467) Setting maximum file descriptors to 4096.
01/30/17 10:58:13 (pid:5467) ******************************************************
01/30/17 10:58:13 (pid:5467) ** condor_schedd (CONDOR_SCHEDD) STARTING UP
01/30/17 10:58:13 (pid:5467) ** /usr/sbin/condor_schedd
01/30/17 10:58:13 (pid:5467) ** SubsystemInfo: name=SCHEDD type=SCHEDD(5) class=DAEMON(1)
01/30/17 10:58:13 (pid:5467) ** Configuration: subsystem:SCHEDD local:<NONE> class:DAEMON
01/30/17 10:58:13 (pid:5467) ** $CondorVersion: 8.6.0 Jan 26 2017 BuildID: 395190 $
01/30/17 10:58:13 (pid:5467) ** $CondorPlatform: x86_64_RedHat7 $
01/30/17 10:58:13 (pid:5467) ** PID = 5467
01/30/17 10:58:13 (pid:5467) ** Log last touched 1/30 10:58:04
01/30/17 10:58:13 (pid:5467) ******************************************************
01/30/17 10:58:13 (pid:5467) Using config source: /etc/condor/condor_config
01/30/17 10:58:13 (pid:5467) Using local config sources:
01/30/17 10:58:13 (pid:5467) Â Â/etc/condor/config.d/01condor_config_IP
01/30/17 10:58:13 (pid:5467) Â Â/etc/condor/config.d/01condor_config_IP_Host
01/30/17 10:58:13 (pid:5467) Â Â/etc/condor/config.d/02condor_config_Access
01/30/17 10:58:13 (pid:5467) Â Â/etc/condor/config.d/03condor_config_flocking
01/30/17 10:58:13 (pid:5467) Â Â/etc/condor/config.d/04condor_config_Docker
01/30/17 10:58:13 (pid:5467) Â Â/etc/condor/condor_config.local
01/30/17 10:58:13 (pid:5467) config Macros = 67, Sorted = 67, StringBytes = 1989, TablesBytes = 2492
01/30/17 10:58:13 (pid:5467) CLASSAD_CACHING is ENABLED
01/30/17 10:58:13 (pid:5467) Daemon Log is logging: D_ALWAYS D_ERROR
01/30/17 10:58:13 (pid:5467) SharedPortEndpoint: waiting for connections to named socket 5430_1b59_4
01/30/17 10:58:13 (pid:5467) DaemonCore: command socket at <x.x.x.:9618?addrs=-9618+[--1]-9618&noUDP&sock=5430_1b59_4>
01/30/17 10:58:13 (pid:5467) DaemonCore: private command socket at <x.x.x.x:9618?addrs=x.x.x.x-9618+[--1]-9618&noUDP&sock=5430_1b59_4>
01/30/17 10:58:13 (pid:5467) History file rotation is enabled.
01/30/17 10:58:13 (pid:5467) Â Maximum history file size is: 20971520 bytes
01/30/17 10:58:13 (pid:5467) Â Number of rotated history files is: 2
01/30/17 10:58:13 (pid:5467) my_popenv: Failed to exec in child, errno=2 (No such file or directory)
01/30/17 10:58:13 (pid:5467) Failed to execute /usr/sbin/condor_shadow.std, ignoring
01/30/17 10:58:19 (pid:5467) TransferQueueManager stats: active up=0/10 down=0/10; waiting up=0 down=0; wait time up=0s down=0s
01/30/17 10:58:19 (pid:5467) TransferQueueManager upload 1m I/O load: 0 bytes/s Â0.000 disk load Â0.000 net load
01/30/17 10:58:19 (pid:5467) TransferQueueManager download 1m I/O load: 0 bytes/s Â0.000 disk load Â0.000 net load



Thanks

Jon