[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] the infamous question mark problem



OK, I think I am hitting this problem here:
https://lists.cs.wisc.edu/archive/condor-users/2005-March/msg00379.shtml


I see the same exact symptoms and I just rebooted a grid node and it
says its "Claimed" but Activiy is "Idle" and there is nothing running
on that box.
I think I need to setup multiple schedulers -- couple of questions:
Can I run multiple schedulers on the same box? My box is a 16core -
96GB RAM system.





On Fri, Mar 26, 2010 at 1:21 PM, Mag Gam <magawake@xxxxxxxxx> wrote:
> On Fri, Mar 26, 2010 at 12:44 PM, Nick LeRoy <nleroy@xxxxxxxxxxx> wrote:
>> Mag,
>>
>>> Once over 1000 jobs hit the pool, I start to see the question marks.
>>> Is there some setting I can look at to fix this?
>>
>> Just had a discussion here about this, and we have a number of questions..
>>
>> 1. What version of Condor are you running?  A recent performance enhancement
>> could possibly be malfunctioning and causing the problems.
>
> The version we are running is 7.2.4
>
>>
>> 2. Do you know what the jobs are doing during these "events"?  Is there a
>> pattern to them?  For example, when you run your 'condor_q -run', do you
>> sometimes see all jobs good, and on other runs a grouping of '??????' jobs?
>
> These jobs are heterogeneous. Some of them are using a simple awk,
> perl, R, and Octave.
>
>>
>> 3. I think that it'd be helpful if you could post the following:
>> 3a. job log snippet(s) around the window in which you've seen the problem
>> 3b. ShadowLog snippet(s) of the same
>>
>> Finally, some observations and a window into our thoughts:
>>
>> 1. When you run 'condor_q -run', it's equivalent to running:
>>  condor_q -const 'JobStatus==2' -format ...
>
> I will try this when the problem occurs. This usually occurs when the
> other department lets us use their systems for overnight simulations.
>
>>
>> 2. It's possible that there's a race condition in which the job's status
>> (JobStatus) has been set to RUNNING (2) without the RemoteHost attribute being
>> set.  This should never happen, but it obviously is.  The answers to the above
>> questions may help us to isolate how this is happening.
>>
>> Thanks Mag,
>>
>> -Nick
>>
>> --
>>           <<< Welcome to the real world. >>>
>>  /`-_    Nicholas R. LeRoy               The Condor Project
>> {     }/ http://www.cs.wisc.edu/~nleroy  http://www.cs.wisc.edu/condor
>>  \    /  nleroy@xxxxxxxxxxx              The University of Wisconsin
>>  |_*_|   608-265-5761                    Department of Computer Sciences
>>
>