[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] condor will not use preferred host at all



If I run ps -efal | grep condor on the 2 execute hosts the only
difference is that on chopin (the one I cannot get condor to use) it
has this:

condor_shared_port

That is not on liszt. Is that an issue?

On Mon, Feb 19, 2018 at 5:31 PM, Larry Martell <larry.martell@xxxxxxxxx> wrote:
> On Mon, Feb 19, 2018 at 5:22 PM, Todd Tannenbaum <tannenba@xxxxxxxxxxx> wrote:
>> On 2/19/2018 4:10 PM, Larry Martell wrote:
>>>
>>> As a test I removed liszt from the config and it still will not use
>>> chopin. Even thought a status shows all slots as 'Unclaimed Idle'
>>>
>>
>> What do you mean when you say "removed liszt from the config" ?
>
> I set NUM_SLOTS to 0
>
>> Do you mean you removed it from your HTCondor pool (i.e. did condor_off on
>> liszt) ?
>>
>> So now when you do a "condor_status", the only thing you see is slots on
>> chopin?
>
> Correct.
>
>> And yet your job remains idle and refuses to run on chopin?
>
> Correct.
>
>> Maybe your job does not match any slots on chopin
>
> What makes a match? Before the reboot the same jobs I am trying to run
> today were running on chopin and when that was full, ran on liszt.
>
>> -- perhaps because chopin
>> has less memory than liszt, for instance.
>
> The 2 machines are identical in every way - CPU, memory, etc. The
> reason they were rebooted was that a 10G point to point connection was
> installed between them.
>
>> See the manual section "Why is
>> the job not running?" at http://tinyurl.com/ycbut82r
>
> That tiny url does not work, but I've been looking at this page:
>
> http://research.cs.wisc.edu/htcondor/CondorWeek2004/presentations/adesmet_admin_tutorial/#DebuggingJobs
>
> and here is the output of condor_q -analyze:verbose when run on the master:
>
> -- Schedd: bach.elucid.local : <192.168.10.2:20734>
> No successful match recorded.
> Last failed match: Mon Feb 19 17:19:52 2018
>
> Reason for last match failure: no match found
>
> 21283.000:  Run analysis summary ignoring user priority.  Of 132 machines,
>       0 are rejected by your job's requirements
>       0 reject your job because of their own requirements
>       0 match and are already running your jobs
>       0 match but are serving other users
>       0 are available to run your job
>
>>> On Mon, Feb 19, 2018 at 4:46 PM, Larry Martell <larry.martell@xxxxxxxxx>
>>> wrote:
>>>>
>>>> I have a master and 2 execute hosts (chopin and liszt) and I have one
>>>> host (chopin) preferred over the other with these settings:
>>>>
>>>> NEGOTIATOR_PRE_JOB_RANK = (10000000 * My.Rank) + \
>>>>     (1000000 * (RemoteOwner =?= UNDEFINED)) + \
>>>>     (100 * Machine =?= "chopin")
>>>> NEGOTIATOR_DEPTH_FIRST = True
>>>>
>>>> The preferred host is chopin. This has been working fine until Friday
>>>> when both execute hosts were rebooted. Since then condor will only run
>>>> jobs on liszt. Even if there are more jobs in the queue then slots on
>>>> liszt it will not use chopin. A condor_status shows all the slots on
>>>> chopin as 'Unclaimed Idle' I see all the proper daemons running and no
>>>> errors in the logs.
>>>>
>>>> How can I debug and/or fix this?