[Condor-users] unable to remove jobs stuck in X state

Heres the thing A normally successful user has submitted 3 vanilla jobs (4461, 4462 & 4463), each of ~180 processes. The first two had bad inputs and were condor_rmd. They are now stuck in the X state with 4463.xxx jobs sitting in Idle. Trying to forceX remove the jobs is unsuccessful

$ condor_rm -debug -forcex 4461

8/5 11:36:21 condor_read(): timeout reading 5 bytes from <xxx.xxx.147.62:45392>.

8/5 11:36:21 IO: Failed to read packet header

8/5 11:36:41 condor_read(): timeout reading 5 bytes from <xxx.xxx.147.62:45392>.

8/5 11:36:41 IO: Failed to read packet header

8/5 11:36:41 AUTHENTICATE: handshake failed!

8/5 11:36:41 DCSchedd: authentication failure: AUTHENTICATE:1002:Failure performing handshake

AUTHENTICATE:1002:Failure performing handshake

Couldn't find/remove all jobs in cluster 4461.

and analysis of the Idle jobs isnt much clearer

$ condor_q 4463.1 -better-analyze

-- Quill: quill@xxxxxxxxxxxxxxxxxxxx : <xxx.xxx.147.62:5432> : quill---

4463.001:  Run analysis summary.  Of 49 machines,

      0 are rejected by your job's requirements

      1 reject your job because of their own requirements

      0 match but are serving users with a better priority in the pool

     48 match but reject the job for unknown reasons

      0 match but will not currently preempt their existing job

      0 are available to run your job

I admit that we have standard network cabling connecting the nodes (1 master, 8 nodes, 48 slots) so it might be crap IO, although this hasnt prevented jobs running over the last couple of years.

Does anyone has any pointers for investigating this?



[Condor 7.0.5 running on Rocks 5.1]

