[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] the infamous question mark problem

On Fri, Mar 26, 2010 at 12:44 PM, Nick LeRoy <nleroy@xxxxxxxxxxx> wrote:
> Mag,
>> Once over 1000 jobs hit the pool, I start to see the question marks.
>> Is there some setting I can look at to fix this?
> Just had a discussion here about this, and we have a number of questions..
> 1. What version of Condor are you running?  A recent performance enhancement
> could possibly be malfunctioning and causing the problems.

The version we are running is 7.2.4

> 2. Do you know what the jobs are doing during these "events"?  Is there a
> pattern to them?  For example, when you run your 'condor_q -run', do you
> sometimes see all jobs good, and on other runs a grouping of '??????' jobs?

These jobs are heterogeneous. Some of them are using a simple awk,
perl, R, and Octave.

> 3. I think that it'd be helpful if you could post the following:
> 3a. job log snippet(s) around the window in which you've seen the problem
> 3b. ShadowLog snippet(s) of the same
> Finally, some observations and a window into our thoughts:
> 1. When you run 'condor_q -run', it's equivalent to running:
>  condor_q -const 'JobStatus==2' -format ...

I will try this when the problem occurs. This usually occurs when the
other department lets us use their systems for overnight simulations.

> 2. It's possible that there's a race condition in which the job's status
> (JobStatus) has been set to RUNNING (2) without the RemoteHost attribute being
> set.  This should never happen, but it obviously is.  The answers to the above
> questions may help us to isolate how this is happening.
> Thanks Mag,
> -Nick
> --
>           <<< Welcome to the real world. >>>
>  /`-_    Nicholas R. LeRoy               The Condor Project
> {     }/ http://www.cs.wisc.edu/~nleroy  http://www.cs.wisc.edu/condor
>  \    /  nleroy@xxxxxxxxxxx              The University of Wisconsin
>  |_*_|   608-265-5761                    Department of Computer Sciences