[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Jobs delayed and schedd logging problem



I was trying to figure out why my jobs took so long to start and noticed some issues in the SchedLog. It seems like it's creating a new log file every 2 seconds or so. Here's the entire SchedLog:

06/12/14 13:47:29 (pid:8249) Now in new log file /usr/local/condor/local.workstation1/log/SchedLog
06/12/14 13:47:29 (pid:8249) Number of Active Workers 2
06/12/14 13:47:29 (pid:8249) GET_JOB_CONNECT_INFO failed: Job 12.1 is not running.

In SchedLog.old I see similar entries:
...
 (pid:79460) Number of Active Workers 1
 (pid:20275) Number of Active Workers 0
 (pid:20275) GET_JOB_CONNECT_INFO failed: Job 12.1 is not running.
 (pid:79460) Number of Active Workers 2
 (pid:20276) Number of Active Workers 1
 (pid:79460) Number of Active Workers 3
 (pid:20276) GET_JOB_CONNECT_INFO failed: Job 12.2 is not running.
 (pid:20277) Number of Active Workers 2
 (pid:20277) GET_JOB_CONNECT_INFO failed: Job 12.0 is not running.

The jobs eventually run but they take much longer than they should to start (sometimes over 30 minutes). I checked the logs. I don't notice any errors in the local logs and nothing appears on the master. It's only a two node cluster with the remote one being the master. There are no other jobs in the global queue.

'condor_q -global' on master reports:
All queues are empty

condor_config.local on both nodes:
START = TRUE
SUSPEND = FALSE
PREEMPT = FALSE
KILL = FALSE

'condor_q -analyze' reports:
012.000:  Request has not yet been considered by the matchmaker.

From SchedLog on master every five minutes:
 -------- Begin starting jobs --------
 -------- Done starting jobs --------
 Getting monitoring info for pid 66690
 JobsRunning = 0
 JobsIdle = 0
 JobsHeld = 0
 JobsRemoved = 0
 LocalUniverseJobsRunning = 0
 LocalUniverseJobsIdle = 0
 SchedUniverseJobsRunning = 0
 SchedUniverseJobsIdle = 0

It looks like schedd is taking a while to send the job to the master but I don't see any reasons why. Any help would be greatly appreciated.

Thanks,

Josh