Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] jobs stuck; cannot get rid of them.

Date: Mon, 03 Dec 2018 10:14:48 +0100
From: Oliver Freyermuth <freyermuth@xxxxxxxxxxxxxxxxxx>
Subject: Re: [HTCondor-users] jobs stuck; cannot get rid of them.

Hi together,

we also observe this regularly. Users complain they still are accounted resources in condor_userprio without running jobs,
then I go and check on all nodes, and find condor_starters running without any job. 

I'm currently using:

for A in <allOurComputeNodes>; do echo $A; ssh $A 'for P in $(pidof condor_starter); do CHILD_CNT=$(ps --ppid $P --no-headers | wc -l); if [ $CHILD_CNT -eq 0 ]; then echo "HTCondor Bug"; pstree -p $P; kill $P; fi; done'; done

to clean up those, but of course that may also catch starters which are just in file transfer, or waiting for new jobs (CLAIM_WORKLIFE). 

It seems to be triggered if the compute node is busy (swapping, hanging) for a short while and not giving timely responses via network. 
A better fix than my described hack would be greatly appreciated. 

Cheers,
	Oliver

Am 29.11.18 um 21:13 schrieb Collin Mehring:
> Hi Stephen,
> 
> We ran into this too. In our case the condor_starter process that was handling each of those jobs didn't exit properly and was still running. Connecting to the host and killing the stuck condor_starter process fixed the issue.Â
> 
> Alternatively, restarting Condor on the hosts will also get rid of anything still running and update the collector.
> 
> Hope that helps,
> Collin
> 
> More Information for the curious:
> 
> Here's the end of the StarterLog for one of the affected slots:
> 08/28/18 18:51:52 ChildAliveMsg: failed to send DC_CHILDALIVE to parent daemon at /<IP removed>/ (try 1 of 3): SECMAN:2003:TCP connection to daemon at /<IP removed>/ failed.
> 08/28/18 18:53:34 ChildAliveMsg: giving up because deadline expired for sending DC_CHILDALIVE to parent.
> 08/28/18 18:53:34 Process exited, pid=43481, status=0
> 
> The pid listed was for the job running on that slot, which successfully exited and finished elsewhere.
> 
> We noticed this happening because it was affecting the group accounting during negotiation. The negotiator would allocate the correct number of slots using the number of jobs from the Schedd, but would then skip negotiation for that group because it used the incorrect number of jobs from the collector when determining current resource usage.
> 
> Here's an example where the submitter had only one pending 1-core job and no running jobs, but there were two stuck slots with 32-core jobs from that submitter:
> 11/26/18 12:40:59 group quotas: group= prod./<group removed>/Â quota= 511.931Â requested= 1Â allocated= 1Â unallocated= 0
> <...>
> 11/26/18 12:40:59 subtree_usage at prod./<group removed>/Âis 64
> <...>
> 11/26/18 12:41:01 Group prod./<group removed>/Â- skipping, at or over quota (quota=511.931) (usage=64) (allocation=1)
> 
> On Thu, Nov 29, 2018 at 5:25 AM Stephen Jones <sjones@xxxxxxxxxxxxxxxx <mailto:sjones@xxxxxxxxxxxxxxxx>> wrote:
> 
>     Hi all,
> 
>     There is a discrepancy between what condor_q thinks is runing, and what
>     condor_status things is running. I run this set of commands to see the
>     difference.
> 
>     # condor_status -af JobId -af Cpus | grep -v undef | sort | sed -e
>     "s/\.0//"> s
>     # condor_qÂ -af ClusterIdÂ -af RequestCpusÂ -constraint "JobStatus=?=2"
>     | sort > q
>     # diff s q
>     1,4d0
>     < 1079641 8
>     < 1080031 8
>     < 1080045 8
>     < 1080321 1
> 
>     See; condor_status has 4 jobs that actually don't exist in condor_q !?!
> 
>     They've been there for days, since I had some Linux problems that needed
>     a reboot (not very related to htcondor.)
> 
>     So I'm losing 25 slots, due to this. How can I purge this stale
>     information from the HTCondor system, good and proper?
> 
>     Cheers,
> 
>     Ste
> 
> 
>     -- 
>     Steve JonesÂ Â Â Â Â Â Â Â Â Â Â Â Â Â Âsjones@xxxxxxxxxxxxxxxx <mailto:sjones@xxxxxxxxxxxxxxxx>
>     Grid System AdministratorÂ Â Â Â Â Â Â Âoffice: 220
>     High Energy Physics DivisionÂ Â Â Â Â Â tel (int): 43396
>     Oliver Lodge LaboratoryÂ Â Â Â Â Â Â Â Âtel (ext): +44 (0)151 794 3396
>     University of LiverpoolÂ Â Â Â Â Â Â Â Âhttp://www.liv.ac.uk/physics/hep/
> 
>     _______________________________________________
>     HTCondor-users mailing list
>     To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx <mailto:htcondor-users-request@xxxxxxxxxxx> with a
>     subject: Unsubscribe
>     You can also unsubscribe by visiting
>     https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
>     The archives can be found at:
>     https://lists.cs.wisc.edu/archive/htcondor-users/
> 
> 
> 
> -- 
> *Collin Mehring *| PE-JoSE - Software Engineer
> 
> 
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/
>

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

Follow-Ups:
- Re: [HTCondor-users] jobs stuck; cannot get rid of them.
  - From: Harald van Pee

Next by Date: Re: [HTCondor-users] jobs stuck; cannot get rid of them.
Next by thread: Re: [HTCondor-users] jobs stuck; cannot get rid of them.
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [HTCondor-users] jobs stuck; cannot get rid of them.