[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] the infamous question mark problem



It seems restarting my collector did not help either.

Any other suggestions?

On Wed, Mar 17, 2010 at 11:43 PM, Mag Gam <magawake@xxxxxxxxx> wrote:
> Minor storage problem = NFS hiccup.
>
> Well, the most recent jobs show the [?????????????????????????????] .
>
> I am going to restart my collector to see if this fixes the problem.
>
> condor_vacate <jid> gives me:
> Can't find address for startd <jobid>
>
>
>
> On Wed, Mar 17, 2010 at 11:31 PM, Nick LeRoy <nleroy@xxxxxxxxxxx> wrote:
>> On Wednesday 17 March 2010, Mag Gam wrote:
>>> last week we had a minor storage problem in our pool. From then on, we
>>> see a lot of '???????' for running host field when we do condor_q -run
>>> -direct schedd
>>>
>>> Is there a way to fix this? I see some jobs which it shows the proper
>>> hostname but I see a lot of '???????' is there a way to free up our
>>> condor pool?
>>
>> Mag,
>>
>> I assume that you know this already, but '???????' is what condor_q displays
>> for ClassAd attributes that aren't in the ClassAd.  In your case, I'd *guess*
>> that the job got evicted from the machine for some reason (without
>> understanding your pool layout, it's difficult to speculate what a "minor
>> storage problem" could cause), but are still in the "run" state...  This
>> makes no sense and AFIK should never happen, but it nonetheless seems to be
>> the case.
>>
>> I think that you'll have to force the jobs to rematch to a new machine.
>> Perhaps 'condor_vacate_job' could be used to accomplish this?
>>
>> Hope this helps
>>
>> -Nick
>>
>> --
>>           <<< The matrix has you. >>>
>>  /`-_    Nicholas R. LeRoy               The Condor Project
>> {     }/ http://www.cs.wisc.edu/~nleroy  http://www.cs.wisc.edu/condor
>>  \    /  nleroy@xxxxxxxxxxx              The University of Wisconsin
>>  |_*_|   608-265-5761                    Department of Computer Sciences
>>
>