[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] jobs stuck; cannot get rid of them.



Dear Tim.

that's *extremely* probable!
I remember just now that we also had this case once when the Submit node locked up (network => file system issues) and could never cleanly release the claims for a prolonged period. 

Since we don't have a test cluster (and no easy reproducer) right now, we will likely just wait to see if the issue goes away with the next stable release (8.18) when it comes out. 

Many thanks for the link!

Cheers and all the best,
	Oliver

Am 05.12.18 um 20:46 schrieb Tim Theisen:
> In 8.7.8, we fixed a bug that could cause a machine slot to become stuck in the Claimed/Busy state after a job completes. https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=6597 Perhaps, that is the source of the problem.
> 
> ...Tim
> 
> On 12/3/18 8:50 AM, Oliver Freyermuth wrote:
>> Hi Harry,
>>
>> we never had 8.4 in use, so I cannot tell...
>>
>> For us, it only happens if user manage to slow down machines so they do not respond for several minutes,
>> e.g. by filling up either all swap space quickly (by performing non-direct I/O with 56 fast jobs in parallel to jobs consuming most memory)
>> or by filling up and fragmenting all memory such that the InfiniBand drivers have issues allocating anything, which also makes the filesystem quite unhappy (clients not responding for > 5 minutes).
>>
>> We are countering that by increasing the memory locked for swiotlb and reducing swappiness in all cgroups (which needs some hacks due to systemd bugs: https://github.com/systemd/systemd/issues/9276 )
>> for now. I guess that you should be able to reproduce by flooding a machine with constant heavy I/O to local scratch or by deadlocking the network card(s).
>> If you need some of our users to help triggering the issue on your cluster, let us know ;-).
>>
>> Cheers and thanks,
>> ÂÂÂÂOliver
>>
>>
>> Am 03.12.18 um 14:56 schrieb Harald van Pee:
>>> Hi all,
>>>
>>> does it happen newly with htcondor 8.6 ?
>>> Never seen it here with htcondor 8.4 and debian 7,8 and 9.
>>>
>>> Best
>>> Harald
>>>
>>>
>>> On Monday, December 3, 2018 10:14:48 AM CET Oliver Freyermuth wrote:
>>>> Hi together,
>>>>
>>>> we also observe this regularly. Users complain they still are accounted
>>>> resources in condor_userprio without running jobs, then I go and check on
>>>> all nodes, and find condor_starters running without any job.
>>>>
>>>> I'm currently using:
>>>>
>>>> for A in <allOurComputeNodes>; do echo $A; ssh $A 'for P in $(pidof
>>>> condor_starter); do CHILD_CNT=$(ps --ppid $P --no-headers | wc -l); if [
>>>> $CHILD_CNT -eq 0 ]; then echo "HTCondor Bug"; pstree -p $P; kill $P; fi;
>>>> done'; done
>>>>
>>>> to clean up those, but of course that may also catch starters which are just
>>>> in file transfer, or waiting for new jobs (CLAIM_WORKLIFE).
>>>>
>>>> It seems to be triggered if the compute node is busy (swapping, hanging) for
>>>> a short while and not giving timely responses via network. A better fix
>>>> than my described hack would be greatly appreciated.
>>>>
>>>> Cheers,
>>>> ÂÂÂÂOliver
>>>>
>>>> Am 29.11.18 um 21:13 schrieb Collin Mehring:
>>>>> Hi Stephen,
>>>>>
>>>>> We ran into this too. In our case the condor_starter process that was
>>>>> handling each of those jobs didn't exit properly and was still running.
>>>>> Connecting to the host and killing the stuck condor_starter process fixed
>>>>> the issue.
>>>>>
>>>>> Alternatively, restarting Condor on the hosts will also get rid of
>>>>> anything still running and update the collector.
>>>>>
>>>>> Hope that helps,
>>>>> Collin
>>>>>
>>>>> More Information for the curious:
>>>>>
>>>>> Here's the end of the StarterLog for one of the affected slots:
>>>>> 08/28/18 18:51:52 ChildAliveMsg: failed to send DC_CHILDALIVE to parent
>>>>> daemon at /<IP removed>/ (try 1 of 3): SECMAN:2003:TCP connection to
>>>>> daemon at /<IP removed>/ failed. 08/28/18 18:53:34 ChildAliveMsg: giving
>>>>> up because deadline expired for sending DC_CHILDALIVE to parent. 08/28/18
>>>>> 18:53:34 Process exited, pid=43481, status=0
>>>>>
>>>>> The pid listed was for the job running on that slot, which successfully
>>>>> exited and finished elsewhere.
>>>>>
>>>>> We noticed this happening because it was affecting the group accounting
>>>>> during negotiation. The negotiator would allocate the correct number of
>>>>> slots using the number of jobs from the Schedd, but would then skip
>>>>> negotiation for that group because it used the incorrect number of jobs
>>>>> from the collector when determining current resource usage.
>>>>>
>>>>> Here's an example where the submitter had only one pending 1-core job and
>>>>> no running jobs, but there were two stuck slots with 32-core jobs from
>>>>> that submitter: 11/26/18 12:40:59 group quotas: group= prod./<group
>>>>> removed>/Â quota= 511.931Â requested= 1Â allocated= 1Â unallocated= 0
>>>>> <...>
>>>>> 11/26/18 12:40:59 subtree_usage at prod./<group removed>/ is 64
>>>>> <...>
>>>>> 11/26/18 12:41:01 Group prod./<group removed>/ - skipping, at or over
>>>>> quota (quota=511.931) (usage=64) (allocation=1)>
>>>>> On Thu, Nov 29, 2018 at 5:25 AM Stephen Jones <sjones@xxxxxxxxxxxxxxxx
>>> <mailto:sjones@xxxxxxxxxxxxxxxx>> wrote:
>>>>> ÂÂÂÂ Hi all,
>>>>> ÂÂÂÂ ÂÂÂÂ There is a discrepancy between what condor_q thinks is runing, and
>>>>> ÂÂÂÂ what
>>>>> ÂÂÂÂ condor_status things is running. I run this set of commands to see the
>>>>> ÂÂÂÂ difference.
>>>>> ÂÂÂÂ ÂÂÂÂ # condor_status -af JobId -af Cpus | grep -v undef | sort | sed -e
>>>>> ÂÂÂÂ "s/\.0//"> s
>>>>>  # condor_q -af ClusterId -af RequestCpus -constraint
>>>>> ÂÂÂÂ "JobStatus=?=2"
>>>>> ÂÂÂÂ ÂÂÂÂ | sort > q
>>>>> ÂÂÂÂ ÂÂÂÂ # diff s q
>>>>> ÂÂÂÂ 1,4d0
>>>>> ÂÂÂÂ < 1079641 8
>>>>> ÂÂÂÂ < 1080031 8
>>>>> ÂÂÂÂ < 1080045 8
>>>>> ÂÂÂÂ < 1080321 1
>>>>> ÂÂÂÂ ÂÂÂÂ See; condor_status has 4 jobs that actually don't exist in condor_q
>>>>> ÂÂÂÂ !?!
>>>>> ÂÂÂÂ ÂÂÂÂ They've been there for days, since I had some Linux problems that
>>>>> ÂÂÂÂ needed
>>>>> ÂÂÂÂ a reboot (not very related to htcondor.)
>>>>> ÂÂÂÂ ÂÂÂÂ So I'm losing 25 slots, due to this. How can I purge this stale
>>>>> ÂÂÂÂ information from the HTCondor system, good and proper?
>>>>> ÂÂÂÂ ÂÂÂÂ Cheers,
>>>>> ÂÂÂÂ ÂÂÂÂ Ste
>>>>> ÂÂÂÂ ÂÂÂÂ ÂÂÂÂ ÂÂÂÂ _______________________________________________
>>>>> ÂÂÂÂ HTCondor-users mailing list
>>>>> ÂÂÂÂ To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
>>>>> ÂÂÂÂ <mailto:htcondor-users-request@xxxxxxxxxxx> with a subject:
>>>>> ÂÂÂÂ Unsubscribe
>>>>> ÂÂÂÂ You can also unsubscribe by visiting
>>>>> ÂÂÂÂ https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>>>> ÂÂÂÂ ÂÂÂÂ The archives can be found at:
>>>>> ÂÂÂÂ https://lists.cs.wisc.edu/archive/htcondor-users/
>>>
>>>
>>>
>>> _______________________________________________
>>> HTCondor-users mailing list
>>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
>>> subject: Unsubscribe
>>> You can also unsubscribe by visiting
>>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>>
>>> The archives can be found at:
>>> https://lists.cs.wisc.edu/archive/htcondor-users/
>>>
>>
>>
>> _______________________________________________
>> HTCondor-users mailing list
>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>
>> The archives can be found at:
>> https://lists.cs.wisc.edu/archive/htcondor-users/
> -- 
> Tim Theisen
> Release Manager
> HTCondor & Open Science Grid
> Center for High Throughput Computing
> Department of Computer Sciences
> University of Wisconsin - Madison
> 4261 Computer Sciences and Statistics
> 1210 W Dayton St
> Madison, WI 53706-1685
> +1 608 265 5736
> 
> 
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/
> 


Attachment: smime.p7s
Description: S/MIME Cryptographic Signature