[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] jobs stuck; cannot get rid of them.



Hi all,

does it happen newly with htcondor 8.6 ? 
Never seen it here with htcondor 8.4 and debian 7,8 and 9.

Best
Harald


On Monday, December 3, 2018 10:14:48 AM CET Oliver Freyermuth wrote:
> Hi together,
> 
> we also observe this regularly. Users complain they still are accounted
> resources in condor_userprio without running jobs, then I go and check on
> all nodes, and find condor_starters running without any job.
> 
> I'm currently using:
> 
> for A in <allOurComputeNodes>; do echo $A; ssh $A 'for P in $(pidof
> condor_starter); do CHILD_CNT=$(ps --ppid $P --no-headers | wc -l); if [
> $CHILD_CNT -eq 0 ]; then echo "HTCondor Bug"; pstree -p $P; kill $P; fi;
> done'; done
> 
> to clean up those, but of course that may also catch starters which are just
> in file transfer, or waiting for new jobs (CLAIM_WORKLIFE).
> 
> It seems to be triggered if the compute node is busy (swapping, hanging) for
> a short while and not giving timely responses via network. A better fix
> than my described hack would be greatly appreciated.
> 
> Cheers,
> 	Oliver
> 
> Am 29.11.18 um 21:13 schrieb Collin Mehring:
> > Hi Stephen,
> > 
> > We ran into this too. In our case the condor_starter process that was
> > handling each of those jobs didn't exit properly and was still running.
> > Connecting to the host and killing the stuck condor_starter process fixed
> > the issue. 
> > 
> > Alternatively, restarting Condor on the hosts will also get rid of
> > anything still running and update the collector.
> > 
> > Hope that helps,
> > Collin
> > 
> > More Information for the curious:
> > 
> > Here's the end of the StarterLog for one of the affected slots:
> > 08/28/18 18:51:52 ChildAliveMsg: failed to send DC_CHILDALIVE to parent
> > daemon at /<IP removed>/ (try 1 of 3): SECMAN:2003:TCP connection to
> > daemon at /<IP removed>/ failed. 08/28/18 18:53:34 ChildAliveMsg: giving
> > up because deadline expired for sending DC_CHILDALIVE to parent. 08/28/18
> > 18:53:34 Process exited, pid=43481, status=0
> > 
> > The pid listed was for the job running on that slot, which successfully
> > exited and finished elsewhere.
> > 
> > We noticed this happening because it was affecting the group accounting
> > during negotiation. The negotiator would allocate the correct number of
> > slots using the number of jobs from the Schedd, but would then skip
> > negotiation for that group because it used the incorrect number of jobs
> > from the collector when determining current resource usage.
> > 
> > Here's an example where the submitter had only one pending 1-core job and
> > no running jobs, but there were two stuck slots with 32-core jobs from
> > that submitter: 11/26/18 12:40:59 group quotas: group= prod./<group
> > removed>/  quota= 511.931  requested= 1  allocated= 1  unallocated= 0
> > <...>
> > 11/26/18 12:40:59 subtree_usage at prod./<group removed>/ is 64
> > <...>
> > 11/26/18 12:41:01 Group prod./<group removed>/ - skipping, at or over
> > quota (quota=511.931) (usage=64) (allocation=1)> 
> > On Thu, Nov 29, 2018 at 5:25 AM Stephen Jones <sjones@xxxxxxxxxxxxxxxx 
<mailto:sjones@xxxxxxxxxxxxxxxx>> wrote:
> >     Hi all,
> >     
> >     There is a discrepancy between what condor_q thinks is runing, and
> >     what
> >     condor_status things is running. I run this set of commands to see the
> >     difference.
> >     
> >     # condor_status -af JobId -af Cpus | grep -v undef | sort | sed -e
> >     "s/\.0//"> s
> >     # condor_q  -af ClusterId  -af RequestCpus  -constraint
> >     "JobStatus=?=2"
> >     
> >     | sort > q
> >     
> >     # diff s q
> >     1,4d0
> >     < 1079641 8
> >     < 1080031 8
> >     < 1080045 8
> >     < 1080321 1
> >     
> >     See; condor_status has 4 jobs that actually don't exist in condor_q
> >     !?!
> >     
> >     They've been there for days, since I had some Linux problems that
> >     needed
> >     a reboot (not very related to htcondor.)
> >     
> >     So I'm losing 25 slots, due to this. How can I purge this stale
> >     information from the HTCondor system, good and proper?
> >     
> >     Cheers,
> >     
> >     Ste
> >     
> >     
> >     
> >     _______________________________________________
> >     HTCondor-users mailing list
> >     To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
> >     <mailto:htcondor-users-request@xxxxxxxxxxx> with a subject:
> >     Unsubscribe
> >     You can also unsubscribe by visiting
> >     https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> >     
> >     The archives can be found at:
> >     https://lists.cs.wisc.edu/archive/htcondor-users/