[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Computing node is not exiting Claimed/Busy State



We recently fixed this bug:

Ticket #6597: Slots can become stuck in Claimed/Busy forever after a job completes

https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=6597

This bug is fixed in the 8.7.8 release.

...Tim


On 08/08/2018 03:19 AM, Florian.Gandor@xxxxxxxxxxxx wrote:

Hi all,

 

From time to time, some computing nodes which execute jobs stay stuck in the Claimed state and busy activity, however the loadAv is 0.000 and the job is successfully completed.

 

I have noticed in the StartLog that this line does not appear “slot1: Called deactivate_claim_forcibly()”

The last line written in the log is: 07/31/18 10:55:28 slot1: Changing activity: Idle -> Busy

 

The job finished at 11:43:37 as written in the StarterLog.slot1:

07/31/18 10:55:30 (pid:5876) Create_Process succeeded, pid=4848

07/31/18 11:43:37 (pid:5876) Process exited, pid=4848, status=0

By comparing to other jobs (previous successful one), these lines are missing:

07/31/18 10:54:04 (pid:5308) Got SIGQUIT.  Performing fast shutdown.

07/31/18 10:54:04 (pid:5308) ShutdownFast all jobs.

07/31/18 10:54:04 (pid:5308) SharedPortEndpoint: Destructor: Problem in thread shutdown notification: 0

07/31/18 10:54:04 (pid:5308) **** condor_starter (condor_STARTER) pid 5308 EXITING WITH STATUS 0

 

 

In the MasterLog, the following error appears approximately 15 minutes after the job completion:

07/31/18 11:58:21 ERROR: Child pid 5068 appears hung! Killing it hard.

07/31/18 11:58:21 DefaultReaper unexpectedly called on pid 5068, status 0.

07/31/18 11:58:21 The SHARED_PORT (pid 5068) was killed because it was no longer responding

07/31/18 11:58:21 restarting C:\PROGRA~2\condor\bin\condor_shared_port.exe in 10 seconds

07/31/18 11:58:31 Collector port not defined, will use default: 9618

07/31/18 11:58:31 Started DaemonCore process "C:\PROGRA~2\condor\bin\condor_shared_port.exe", pid and pgroup = 5396

 

Does someone has an idea why those computing node stay stuck in Claimed/Busy mode?

For now on, we have to restart the computing node in order to get the computing node running again…

 

Cheers and thanks!

Florian Gandor



_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

-- 
Tim Theisen
Release Manager
HTCondor & Open Science Grid
Center for High Throughput Computing
Department of Computer Sciences
University of Wisconsin - Madison
4261 Computer Sciences and Statistics
1210 W Dayton St
Madison, WI 53706-1685
+1 608 265 5736