[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Jobs stuck in the removing (JobStatus == 3) state



I've got a user who some how managed get a handful of jobs stuck in the
JobStatus == 3 (X) state. No amount of condor_rm'ing has been able to
get these things out of the queue. We're running Quill and I think the
problem may be just that Quill is hung up and not updating the status,
but I can't get -direct schedd on condor_q to work so I can't verify
this.

Here are the jobs as Quill reports them:

/ttcbatch> /opt/condor/bin/condor_q -const "JobStatus == 3" -direct
quilld
 
 
-- Submitter: quill-sj-schedd1.altera.com@xxxxxxxxxxxxxxxxxxxxx :
<137.57.202.107:40428> : sj-schedd1.altera.com
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
85728.0   pkazaria        5/21 16:49   0+00:00:00 X  0   1074.2
wrapper.pl /experi
85728.33  pkazaria        5/21 16:49   0+00:00:00 X  0   898.4
wrapper.pl /experi
85728.35  pkazaria        5/21 16:49   0+00:00:00 X  0   878.9
wrapper.pl /experi
85728.50  pkazaria        5/21 16:49   0+00:00:01 X  0   58.6 wrapper.pl
/experi
91944.63  pkazaria        5/24 09:10   0+00:02:07 X  0   712.9
wrapper.bat /exper

And if I try to get this info straight from the schedd I get:

/ttcbatch> /opt/condor/bin/condor_q -const "JobStatus == 3" -direct
schedd
 
-- Failed to fetch ads from: <137.57.202.107:52744> :
sj-schedd1.altera.com

And condor_rm says:

/ttcbatch> /opt/condor/bin/condor_rm -const "JobStatus == 3"
AUTHENTICATE:1002:Failure performing handshake
Couldn't find/remove all jobs matching constraint (JobStatus == 3)

And my ScheddLog is littered with:

/ttcbatch> tail /build/condor/log/SchedLog
5/25 12:36:55 OwnerCheck(root) failed in SetAttribute for job 85728.0
5/25 12:36:55 OwnerCheck(root) failed in SetAttribute for job 85728.0
5/25 12:36:55 OwnerCheck(root) failed in SetAttribute for job 85728.0
5/25 12:36:55 OwnerCheck(root) failed in SetAttribute for job 85728.0
5/25 12:36:55 OwnerCheck(root) failed in SetAttribute for job 85728.0
5/25 12:36:55 OwnerCheck(root) failed in SetAttribute for job 85728.0
5/25 12:36:55 OwnerCheck(root) failed in SetAttribute for job 85728.0
5/25 12:36:55 OwnerCheck(root) failed in SetAttribute for job 85728.0
5/25 12:36:55 OwnerCheck(root) failed in SetAttribute for job 85728.0
5/25 12:36:55 OwnerCheck(root) failed in SetAttribute for job 85728.0

I'd rather not restart my schedd. Is there a way to clear out this
problem that might not require a schedd reboot?

- Ian

--
Ian R. Chesal <ichesal@xxxxxxxxxx>
Senior Software Engineer

Altera Corporation
Toronto Technology Center
Tel: (416) 926-8300