[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] unable to remove jobs stuck in X state

Title: unable to remove jobs stuck in X state


Heres the thing A normally successful user has submitted 3 vanilla jobs (4461, 4462 & 4463), each of ~180 processes. The first two had bad inputs and were condor_rmd. They are now stuck in the X state with 4463.xxx jobs sitting in Idle. Trying to forceX remove the jobs is unsuccessful

$ condor_rm -debug -forcex 4461

8/5 11:36:21 condor_read(): timeout reading 5 bytes from <xxx.xxx.147.62:45392>.

8/5 11:36:21 IO: Failed to read packet header

8/5 11:36:41 condor_read(): timeout reading 5 bytes from <xxx.xxx.147.62:45392>.

8/5 11:36:41 IO: Failed to read packet header

8/5 11:36:41 AUTHENTICATE: handshake failed!

8/5 11:36:41 DCSchedd: authentication failure: AUTHENTICATE:1002:Failure performing handshake

AUTHENTICATE:1002:Failure performing handshake

Couldn't find/remove all jobs in cluster 4461.

and analysis of the Idle jobs isnt much clearer

$ condor_q 4463.1 -better-analyze

-- Quill: quill@xxxxxxxxxxxxxxxxxxxx : <xxx.xxx.147.62:5432> : quill---

4463.001:  Run analysis summary.  Of 49 machines,

      0 are rejected by your job's requirements

      1 reject your job because of their own requirements

      0 match but are serving users with a better priority in the pool

     48 match but reject the job for unknown reasons

      0 match but will not currently preempt their existing job

      0 are available to run your job

I admit that we have standard network cabling connecting the nodes (1 master, 8 nodes, 48 slots) so it might be crap IO, although this hasnt prevented jobs running over the last couple of years.

Does anyone has any pointers for investigating this?



Health Protection Agency


[Condor 7.0.5 running on Rocks 5.1]

************************************************************************** The information contained in the EMail and any attachments is confidential and intended solely and for the attention and use of the named addressee(s). It may not be disclosed to any other person without the express authority of the HPA, or the intended recipient, or both. If you are not the intended recipient, you must not disclose, copy, distribute or retain this message or any part of it. This footnote also confirms that this EMail has been swept for computer viruses, but please re-sweep any attachments before opening or saving. HTTP://www.HPA.org.uk **************************************************************************