[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] the infamous question mark problem



Once over 1000 jobs hit the pool, I start to see the question marks.
Is there some setting I can look at to fix this?



On Thu, Mar 25, 2010 at 8:32 PM, Mag Gam <magawake@xxxxxxxxx> wrote:
> Any help would be appreciated.
>
> netstat shows no UDP packet loss but I find it strange that condor
> can't cope with 500 servers in the pool. Is there something else I
> should be looking into?
>
>
>
> On Fri, Mar 19, 2010 at 9:43 PM, Mag Gam <magawake@xxxxxxxxx> wrote:
>> Any thoughts nick?
>>
>>
>> On Thu, Mar 18, 2010 at 8:31 PM, Mag Gam <magawake@xxxxxxxxx> wrote:
>>> Hello Nick:
>>>
>>> Is there a way to clear this up? It seems this occurs during off hours
>>> when more servers come into my pool
>>>
>>>
>>> On Wed, Mar 17, 2010 at 11:53 PM, Mag Gam <magawake@xxxxxxxxx> wrote:
>>>> It seems restarting my collector did not help either.
>>>>
>>>> Any other suggestions?
>>>>
>>>> On Wed, Mar 17, 2010 at 11:43 PM, Mag Gam <magawake@xxxxxxxxx> wrote:
>>>>> Minor storage problem = NFS hiccup.
>>>>>
>>>>> Well, the most recent jobs show the [?????????????????????????????] .
>>>>>
>>>>> I am going to restart my collector to see if this fixes the problem.
>>>>>
>>>>> condor_vacate <jid> gives me:
>>>>> Can't find address for startd <jobid>
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Mar 17, 2010 at 11:31 PM, Nick LeRoy <nleroy@xxxxxxxxxxx> wrote:
>>>>>> On Wednesday 17 March 2010, Mag Gam wrote:
>>>>>>> last week we had a minor storage problem in our pool. From then on, we
>>>>>>> see a lot of '???????' for running host field when we do condor_q -run
>>>>>>> -direct schedd
>>>>>>>
>>>>>>> Is there a way to fix this? I see some jobs which it shows the proper
>>>>>>> hostname but I see a lot of '???????' is there a way to free up our
>>>>>>> condor pool?
>>>>>>
>>>>>> Mag,
>>>>>>
>>>>>> I assume that you know this already, but '???????' is what condor_q displays
>>>>>> for ClassAd attributes that aren't in the ClassAd.  In your case, I'd *guess*
>>>>>> that the job got evicted from the machine for some reason (without
>>>>>> understanding your pool layout, it's difficult to speculate what a "minor
>>>>>> storage problem" could cause), but are still in the "run" state...  This
>>>>>> makes no sense and AFIK should never happen, but it nonetheless seems to be
>>>>>> the case.
>>>>>>
>>>>>> I think that you'll have to force the jobs to rematch to a new machine.
>>>>>> Perhaps 'condor_vacate_job' could be used to accomplish this?
>>>>>>
>>>>>> Hope this helps
>>>>>>
>>>>>> -Nick
>>>>>>
>>>>>> --
>>>>>>           <<< The matrix has you. >>>
>>>>>>  /`-_    Nicholas R. LeRoy               The Condor Project
>>>>>> {     }/ http://www.cs.wisc.edu/~nleroy  http://www.cs.wisc.edu/condor
>>>>>>  \    /  nleroy@xxxxxxxxxxx              The University of Wisconsin
>>>>>>  |_*_|   608-265-5761                    Department of Computer Sciences
>>>>>>
>>>>>
>>>>
>>>
>>
>