[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] unable to remove jobs stuck in X state



First thing - Condor 7.0.5 is almost 3 years old. Please go to a newer version, at least 7.4, preferably 7.6.

The message in the MasterLog about InstanceLock is the condor_master preventing two copies of itself with the same configuration from running at once. If you have no condor_master running you can clean up the InstanceLock. It should be cleaned up by condor_master generally.

As for the X state business. That could be a number of things, including bugs fixed over the last 3 years. Consider 7.6.

Best,


matt

On 08/08/2011 10:32 AM, Steven Platt wrote:
Another update...

I'm still getting this problem. A couple of jobs (similar spec as below) ran really smoothly this morning then 1 started accumulating run time with no activity on the nodes. Condor_rm puts the jobs into 'X' state and now they're stuck there again.

Examination of the MasterLog indicates that it's similar to what's reported in https://www-auth.cs.wisc.edu/lists/condor-users/2008-February/msg00044.shtml but there's only one instance of condor running (confirmed by 'ps aux | grep condor' on head&  nodes) with daemons writing to log files on their local machines.

$ tail -n 50 MasterLog

8/8 15:15:53 ******************************************************
8/8 15:15:53 ** condor_master (CONDOR_MASTER) STARTING UP
8/8 15:15:53 ** /opt/condor/sbin/condor_master
8/8 15:15:53 ** $CondorVersion: 7.0.5 Sep 20 2008 BuildID: 105846 $
8/8 15:15:53 ** $CondorPlatform: X86_64-LINUX_RHEL5 $
8/8 15:15:53 ** PID = 1626
8/8 15:15:53 ** Log last touched 8/8 15:15:53
8/8 15:15:53 ******************************************************
8/8 15:15:53 Using config source: /home/condor/condor_config
8/8 15:15:53 Using local config sources:
8/8 15:15:53    /opt/condor/etc/condor_config.local
8/8 15:15:53 FileLock::obtain(1) failed - errno 11 (Resource temporarily unavailable)
8/8 15:15:53 ERROR "Can't get lock on "/tmp/condor-lock.queen/InstanceLock"" at line 848 in file master.C
8/8 15:18:53 ******************************************************
8/8 15:18:53 ** condor_master (CONDOR_MASTER) STARTING UP
8/8 15:18:53 ** /opt/condor/sbin/condor_master
8/8 15:18:53 ** $CondorVersion: 7.0.5 Sep 20 2008 BuildID: 105846 $
8/8 15:18:53 ** $CondorPlatform: X86_64-LINUX_RHEL5 $
8/8 15:18:53 ** PID = 1894
8/8 15:18:53 ** Log last touched 8/8 15:18:23
8/8 15:18:53 ******************************************************
8/8 15:18:53 Using config source: /home/condor/condor_config
8/8 15:18:53 Using local config sources:
8/8 15:18:53    /opt/condor/etc/condor_config.local
8/8 15:18:53 FileLock::obtain(1) failed - errno 11 (Resource temporarily unavailable)
8/8 15:18:53 ERROR "Can't get lock on "/tmp/condor-lock.queen/InstanceLock"" at line 848 in file master.C
8/8 15:18:53 ******************************************************
8/8 15:18:53 ** condor_master (CONDOR_MASTER) STARTING UP
8/8 15:18:53 ** /opt/condor/sbin/condor_master
8/8 15:18:53 ** $CondorVersion: 7.0.5 Sep 20 2008 BuildID: 105846 $
8/8 15:18:53 ** $CondorPlatform: X86_64-LINUX_RHEL5 $
8/8 15:18:53 ** PID = 1899
8/8 15:18:53 ** Log last touched 8/8 15:18:53
8/8 15:18:53 ******************************************************
8/8 15:18:53 Using config source: /home/condor/condor_config
8/8 15:18:53 Using local config sources:
8/8 15:18:53    /opt/condor/etc/condor_config.local
8/8 15:18:53 FileLock::obtain(1) failed - errno 11 (Resource temporarily unavailable)
8/8 15:18:53 ERROR "Can't get lock on "/tmp/condor-lock.queen/InstanceLock"" at line 848 in file master.C

I'd really appreciate any help as it's crippling our cluster.

Thanks

Steve

-----Original Message-----
From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Steven Platt
Sent: 05 August 2011 17:29
To: Condor-Users Mail List
Subject: Re: [Condor-users] unable to remove jobs stuck in X state

Update...

Deleting the job_queue.log (and all clusterX.procX.* directories) from the SPOOL directory on the submit machine and then restarting condor master clears everything from the queue, although it does reset job numbering&  history to 1.

While this works it seems like a real sledgehammer tactic for something that can probably be done more selectively. Any ideas?

Steve
________________________________________
From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Steven Platt
Sent: 05 August 2011 11:58
To: Condor-Users Mail List
Subject: [Condor-users] unable to remove jobs stuck in X state

Hello,
Here's the thing ... A normally successful user has submitted 3 vanilla jobs (4461, 4462&  4463), each of ~180 processes. The first two had bad inputs and were condor_rm'd. They are now stuck in the X state with 4463.xxx jobs sitting in Idle. Trying to forceX remove the jobs is unsuccessful...

$ condor_rm -debug -forcex 4461
8/5 11:36:21 condor_read(): timeout reading 5 bytes from<xxx.xxx.147.62:45392>.
8/5 11:36:21 IO: Failed to read packet header
8/5 11:36:41 condor_read(): timeout reading 5 bytes from<xxx.xxx.147.62:45392>.
8/5 11:36:41 IO: Failed to read packet header
8/5 11:36:41 AUTHENTICATE: handshake failed!
8/5 11:36:41 DCSchedd: authentication failure: AUTHENTICATE:1002:Failure performing handshake
AUTHENTICATE:1002:Failure performing handshake
Couldn't find/remove all jobs in cluster 4461.

...and analysis of the Idle jobs isn't much clearer...

$ condor_q 4463.1 -better-analyze
-- Quill: quill@xxxxxxxxxxxxxxxxxxxx :<xxx.xxx.147.62:5432>  : quill---
4463.001:  Run analysis summary.  Of 49 machines,
       0 are rejected by your job's requirements
       1 reject your job because of their own requirements
       0 match but are serving users with a better priority in the pool
      48 match but reject the job for unknown reasons
       0 match but will not currently preempt their existing job
       0 are available to run your job
I admit that we have standard network cabling connecting the nodes (1 master, 8 nodes, 48 slots) so it might be crap IO, although this hasn't prevented jobs running over the last couple of years.
Does anyone have any pointers for investigating this?

Thanks

Steve
Health Protection Agency
UK
[Condor 7.0.5 running on Rocks 5.1]
-----------------------------------------
**************************************************************************
The information contained in the EMail and any attachments is
confidential and intended solely and for the attention and use of
the named addressee(s). It may not be disclosed to any other person
without the express authority of the HPA, or the intended
recipient, or both. If you are not the intended recipient, you must
not disclose, copy, distribute or retain this message or any part
of it. This footnote also confirms that this EMail has been swept
for computer viruses, but please re-sweep any attachments before
opening or saving. HTTP://www.HPA.org.uk
**************************************************************************
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/