[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Computing node is not exiting Claimed/Busy State



Hi all,

 

From time to time, some computing nodes which execute jobs stay stuck in the Claimed state and busy activity, however the loadAv is 0.000 and the job is successfully completed.

 

I have noticed in the StartLog that this line does not appear “slot1: Called deactivate_claim_forcibly()”

The last line written in the log is: 07/31/18 10:55:28 slot1: Changing activity: Idle -> Busy

 

The job finished at 11:43:37 as written in the StarterLog.slot1:

07/31/18 10:55:30 (pid:5876) Create_Process succeeded, pid=4848

07/31/18 11:43:37 (pid:5876) Process exited, pid=4848, status=0

By comparing to other jobs (previous successful one), these lines are missing:

07/31/18 10:54:04 (pid:5308) Got SIGQUIT.  Performing fast shutdown.

07/31/18 10:54:04 (pid:5308) ShutdownFast all jobs.

07/31/18 10:54:04 (pid:5308) SharedPortEndpoint: Destructor: Problem in thread shutdown notification: 0

07/31/18 10:54:04 (pid:5308) **** condor_starter (condor_STARTER) pid 5308 EXITING WITH STATUS 0

 

 

In the MasterLog, the following error appears approximately 15 minutes after the job completion:

07/31/18 11:58:21 ERROR: Child pid 5068 appears hung! Killing it hard.

07/31/18 11:58:21 DefaultReaper unexpectedly called on pid 5068, status 0.

07/31/18 11:58:21 The SHARED_PORT (pid 5068) was killed because it was no longer responding

07/31/18 11:58:21 restarting C:\PROGRA~2\condor\bin\condor_shared_port.exe in 10 seconds

07/31/18 11:58:31 Collector port not defined, will use default: 9618

07/31/18 11:58:31 Started DaemonCore process "C:\PROGRA~2\condor\bin\condor_shared_port.exe", pid and pgroup = 5396

 

Does someone has an idea why those computing node stay stuck in Claimed/Busy mode?

For now on, we have to restart the computing node in order to get the computing node running again…

 

Cheers and thanks!

Florian Gandor