[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Failure to Submit Jobs - Intensive IO on one of the Nodes Causing Collector Outage



Just confirmed and solved the problem by reinstalling the OS to an
SSD, no more failures to submit.

On Wed, Feb 19, 2014 at 4:38 AM, Andrey Kuznetsov <akuznet1@xxxxxxxx> wrote:
> Apparently, the disk load is so high on the processing machine that
> condor is having problems communicating, resulting in submit failure.
> When jobs are submitted to be run on the collector machine with the
> same 100% disk IO, the job submission never fails, that is probably
> because the OS and condor logs/etc files are on a dedicated SSD disk,
> instead of on the RAID5 which is being 100% utilized. Not sure how
> much it matters that the submission machine is the collector machine
> and that the required machine is the collector for the case where no
> jobs fail to submit.
>
> On Tue, Feb 18, 2014 at 9:21 AM, Andrey Kuznetsov <akuznet1@xxxxxxxx> wrote:
>> I thought it was clear when I said one machine is a collector and another
>> machine is the data processor.
>>
>> The collector is idle with no load.
>>
>> --
>> Andrey Kuznetsov
>>
>> On Feb 18, 2014 7:31 AM, "Ben Cotton" <ben.cotton@xxxxxxxxxxxxxxxxxx> wrote:
>>>
>>> Andrey,
>>>
>>> Does your collector also run jobs? It's not clear from your
>>> description. If that's the case, you may want to dedicate another
>>> machine for the scheduler and central manager roles.
>>>
>>>
>>> Thanks,
>>> BC
>>>
>>>
>>> --
>>> Ben Cotton
>>> main: 888.292.5320
>>>
>>> Cycle Computing
>>> Leader in Utility HPC Software
>>>
>>> http://www.cyclecomputing.com
>>> twitter: @cyclecomputing
>>> _______________________________________________
>>> HTCondor-users mailing list
>>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with
>>> a
>>> subject: Unsubscribe
>>> You can also unsubscribe by visiting
>>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>>
>>> The archives can be found at:
>>> https://lists.cs.wisc.edu/archive/htcondor-users/
>
>
>
> --
> Andrey Kuznetsov <akuznet1@xxxxxxxx>



-- 
Andrey Kuznetsov <akuznet1@xxxxxxxx>