[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [condor-users] Can't Remove Jobs from the Queue


I'm runnning Condor 6.5.5. Vanilla and Java universe jobs are affected. Sometimes the situation is as follows. A job is running on a machine. After some time I notice that the machine is Unclaimed or running another job, but according to condor_q the original job is still allegedly running on that machine. In this situation if I try to remove the job which isn't actually running I end up having it in this X state forever.

Sometimes I wonder if this situation arises because my central manager and collector is running a firewall. However, TCP/UDP ports 9614 and 9618 are open as well as a big range of ports for shadows. The thing is that some machines disappear from condor_status for a while and then might come back. I could not find anything strange in the logs of those machines. Only the collector log of my central manager has lots of entries like Removing stale ad, Inserting ad, Removing stale ad, Inserting...

Alexander Klyubin

Nick LeRoy wrote:
On Wednesday 22 October 2003 1:03 pm, Filip Defoort wrote:

I've actually seen this behaviour also (using 6.4.7)... Only thing I
found was to do a

- filip

Alexander Klyubin wrote:


When I run condor_rm for a particular job it is marked as X and then
quickly removed from the queue. That's the normal expected behavior.
However, sometimes the job remains in the queue in this state X
forever till I restart Condor on the machine where the queue is located.

My question is, how do I remove the stale jobs from the queue without
restarting Condor? Any ideas?

Could you tell us more about the characteristics of the jobs that are hanging around? Are they all of some particular universe (standard, vanilla, ...)? Have you tried to reproduce this behaviour in 6.5.5? Have you looked in the ShadowLog / SchedLog for any clues?



Attachment: pgp00008.pgp
Description: PGP signature