[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] jobs stuck; cannot get rid of them.

Don't worry - I found a way. For the record: Get the slot name and the machine:

# condor_status -af Name -af Machine -af JobId -af Cpus | grep -v undef | sort | sed -e "s/\.0//"Â | grep 1080321
slot1_5@xxxxxxxxxxxxxxxxxxxx r26-n02.ph.liv.ac.uk 1080321 1

Go to the machine and get the PID:

# ps -ef | grep condor_ | grep "slot1_1 "
condorÂÂ 12333 11661Â 0 Nov19 ?ÂÂÂÂÂÂÂ 00:00:05 condor_starter -f -a slot1_1 igrid5.ph.liv.ac.uk

Kill that process; job done. Cheers,


On 29/11/18 13:24, Stephen Jones wrote:
Hi all,

There is a discrepancy between what condor_q thinks is runing, and what condor_status things is running. I run this set of commands to see the difference.

# condor_status -af JobId -af Cpus | grep -v undef | sort | sed -e "s/\.0//"> s # condor_q -af ClusterId -af RequestCpus -constraint "JobStatus=?=2" | sort > q
# diff s q
< 1079641 8
< 1080031 8
< 1080045 8
< 1080321 1

See; condor_status has 4 jobs that actually don't exist in condor_q !?!

They've been there for days, since I had some Linux problems that needed a reboot (not very related to htcondor.)

So I'm losing 25 slots, due to this. How can I purge this stale information from the HTCondor system, good and proper?



Steve Jones                             sjones@xxxxxxxxxxxxxxxx
Grid System Administrator               office: 220
High Energy Physics Division            tel (int): 43396
Oliver Lodge Laboratory                 tel (ext): +44 (0)151 794 3396
University of Liverpool                 http://www.liv.ac.uk/physics/hep/