[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Can't remove stuck jobs (Re: SECMAN:2007:Failed to end classad message.)



PS:

condor_q -global -debug

prints the queue from the 1st submit host right away, then hangs for 20+
seconds, then prints

06/19/13 11:52:18 condor_read(): timeout reading 5 bytes from schedd at
<my.ip:31286>.
06/19/13 11:52:18 IO: Failed to read packet header
06/19/13 11:52:18 SECMAN: no classad from server, failing

-- Failed to fetch ads from: <my.ip:31286> : 2nd.submit.host
SECMAN:2007:Failed to end classad message.


There is a bunch of jobs that got condor_rm'ed but are stuck someplace:
ps -AF shows a lot of


condor_scheduniv_exec.238602.0 -f -l . -Lockfile moldag3.dag.lock
-AutoRescue 1 -DoRescueFrom 0 -Dag moldag3.dag -Suppress_notification
-CsdVersion $CondorVersion: 8.0.0 May 29 2013 BuildID: 133173 $ -Dagman
/usr/bin/condor_dagman -Update_submit


If I stop condor they go away and come back when I start condor.
condor_rm returns the same "SECMAN:2007:Failed to end classad message."

How do I clean this up?

-- 
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu

Attachment: signature.asc
Description: OpenPGP digital signature