[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Failure to Submit Jobs - Intensive IO on one of the Nodes Causing Collector Outage



Apparently, the disk load is so high on the processing machine that
condor is having problems communicating, resulting in submit failure.
When jobs are submitted to be run on the collector machine with the
same 100% disk IO, the job submission never fails, that is probably
because the OS and condor logs/etc files are on a dedicated SSD disk,
instead of on the RAID5 which is being 100% utilized. Not sure how
much it matters that the submission machine is the collector machine
and that the required machine is the collector for the case where no
jobs fail to submit.

On Tue, Feb 18, 2014 at 9:21 AM, Andrey Kuznetsov <akuznet1@xxxxxxxx> wrote:
> I thought it was clear when I said one machine is a collector and another
> machine is the data processor.
>
> The collector is idle with no load.
>
> --
> Andrey Kuznetsov
>
> On Feb 18, 2014 7:31 AM, "Ben Cotton" <ben.cotton@xxxxxxxxxxxxxxxxxx> wrote:
>>
>> Andrey,
>>
>> Does your collector also run jobs? It's not clear from your
>> description. If that's the case, you may want to dedicate another
>> machine for the scheduler and central manager roles.
>>
>>
>> Thanks,
>> BC
>>
>>
>> --
>> Ben Cotton
>> main: 888.292.5320
>>
>> Cycle Computing
>> Leader in Utility HPC Software
>>
>> http://www.cyclecomputing.com
>> twitter: @cyclecomputing
>> _______________________________________________
>> HTCondor-users mailing list
>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with
>> a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>
>> The archives can be found at:
>> https://lists.cs.wisc.edu/archive/htcondor-users/



-- 
Andrey Kuznetsov <akuznet1@xxxxxxxx>