Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Failure to Submit Jobs - Intensive IO on one of the Nodes Causing Collector Outage

Date: Wed, 19 Feb 2014 04:38:55 -0800
From: Andrey Kuznetsov <akuznet1@xxxxxxxx>
Subject: Re: [HTCondor-users] Failure to Submit Jobs - Intensive IO on one of the Nodes Causing Collector Outage

Apparently, the disk load is so high on the processing machine that
condor is having problems communicating, resulting in submit failure.
When jobs are submitted to be run on the collector machine with the
same 100% disk IO, the job submission never fails, that is probably
because the OS and condor logs/etc files are on a dedicated SSD disk,
instead of on the RAID5 which is being 100% utilized. Not sure how
much it matters that the submission machine is the collector machine
and that the required machine is the collector for the case where no
jobs fail to submit.

On Tue, Feb 18, 2014 at 9:21 AM, Andrey Kuznetsov <akuznet1@xxxxxxxx> wrote:
> I thought it was clear when I said one machine is a collector and another
> machine is the data processor.
>
> The collector is idle with no load.
>
> --
> Andrey Kuznetsov
>
> On Feb 18, 2014 7:31 AM, "Ben Cotton" <ben.cotton@xxxxxxxxxxxxxxxxxx> wrote:
>>
>> Andrey,
>>
>> Does your collector also run jobs? It's not clear from your
>> description. If that's the case, you may want to dedicate another
>> machine for the scheduler and central manager roles.
>>
>>
>> Thanks,
>> BC
>>
>>
>> --
>> Ben Cotton
>> main: 888.292.5320
>>
>> Cycle Computing
>> Leader in Utility HPC Software
>>
>> http://www.cyclecomputing.com
>> twitter: @cyclecomputing
>> _______________________________________________
>> HTCondor-users mailing list
>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with
>> a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>
>> The archives can be found at:
>> https://lists.cs.wisc.edu/archive/htcondor-users/

-- 
Andrey Kuznetsov <akuznet1@xxxxxxxx>

Follow-Ups:
- Re: [HTCondor-users] Failure to Submit Jobs - Intensive IO on one of the Nodes Causing Collector Outage
  - From: Andrey Kuznetsov

References:
- [HTCondor-users] Failure to Submit Jobs - Intensive IO on one of the Nodes Causing Collector Outage
  - From: Andrey Kuznetsov
- Re: [HTCondor-users] Failure to Submit Jobs - Intensive IO on one of the Nodes Causing Collector Outage
  - From: Ben Cotton
- Re: [HTCondor-users] Failure to Submit Jobs - Intensive IO on one of the Nodes Causing Collector Outage
  - From: Andrey Kuznetsov

Prev by Date: Re: [HTCondor-users] Flocking problem!
Next by Date: [HTCondor-users] Some of my jobs went to held stage
Previous by thread: Re: [HTCondor-users] Failure to Submit Jobs - Intensive IO on one of the Nodes Causing Collector Outage
Next by thread: Re: [HTCondor-users] Failure to Submit Jobs - Intensive IO on one of the Nodes Causing Collector Outage
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [HTCondor-users] Failure to Submit Jobs - Intensive IO on one of the Nodes Causing Collector Outage