[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] the infamous question mark problem



Back in January I also submitted a query about this problem (*1). We "solved" it by backing off to condor 6.4.8 (from 7.4.0). I'm in the process of upgrading to 7.4.1, and wondering if setting:
NEGOTIATOR_INFORM_STARTD = False         (*2)

from will be the fix?
 **** should this be set just in ~/etc/condor.config?


(*1)From: dalonso <dalonso@xxxxxxxxxxxxxxxx>
Date: January 26, 2010 10:15:14 AM PST
To: condor-users@xxxxxxxxxxx
Subject: claimed slots are idle

(*2) from From: Dan Bradley <dan@xxxxxxxxxxxx>
Date: February 24, 2010 7:56:12 AM PST
To: Condor-Users Mail List <condor-users@xxxxxxxxxxx>
Subject: Re: [Condor-users] Condor 7.2.4 / 7.4.1 — "Can't find resource with ClaimId" errors from startd


On Mar 26, 2010, at 3:47 PM, Mag Gam wrote:

OK, I think I am hitting this problem here:
https://lists.cs.wisc.edu/archive/condor-users/2005-March/msg00379.shtml


I see the same exact symptoms and I just rebooted a grid node and it
says its "Claimed" but Activiy is "Idle" and there is nothing running
on that box.
I think I need to setup multiple schedulers -- couple of questions:
Can I run multiple schedulers on the same box? My box is a 16core -
96GB RAM system.





On Fri, Mar 26, 2010 at 1:21 PM, Mag Gam <magawake@xxxxxxxxx> wrote:
On Fri, Mar 26, 2010 at 12:44 PM, Nick LeRoy <nleroy@xxxxxxxxxxx> wrote:
Mag,

Once over 1000 jobs hit the pool, I start to see the question marks.
Is there some setting I can look at to fix this?

Just had a discussion here about this, and we have a number of questions..

1. What version of Condor are you running? A recent performance enhancement
could possibly be malfunctioning and causing the problems.

The version we are running is 7.2.4


2. Do you know what the jobs are doing during these "events"? Is there a pattern to them? For example, when you run your 'condor_q -run', do you sometimes see all jobs good, and on other runs a grouping of '??????' jobs?

These jobs are heterogeneous. Some of them are using a simple awk,
perl, R, and Octave.


3. I think that it'd be helpful if you could post the following:
3a. job log snippet(s) around the window in which you've seen the problem
3b. ShadowLog snippet(s) of the same

Finally, some observations and a window into our thoughts:

1. When you run 'condor_q -run', it's equivalent to running:
 condor_q -const 'JobStatus==2' -format ...

I will try this when the problem occurs. This usually occurs when the
other department lets us use their systems for overnight simulations.


2. It's possible that there's a race condition in which the job's status (JobStatus) has been set to RUNNING (2) without the RemoteHost attribute being set. This should never happen, but it obviously is. The answers to the above
questions may help us to isolate how this is happening.

Thanks Mag,

-Nick

--
          <<< Welcome to the real world. >>>
 /`-_    Nicholas R. LeRoy               The Condor Project
{     }/ http://www.cs.wisc.edu/~nleroy  http://www.cs.wisc.edu/condor
 \    /  nleroy@xxxxxxxxxxx              The University of Wisconsin
|_*_| 608-265-5761 Department of Computer Sciences


_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/


Darwin O.V. Alonso
dalonso@xxxxxxxxxxxxxxxx
Dept. Biochem. J558(HSB)
University of Washington
1705 NE Pacific St
Seattle WA 98195-7350