[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Failure to Submit Jobs - Intensive IO on one of the Nodes Causing Collector Outage


I am experiencing failures to submit batches of 10 jobs at a time.

I have 2 machines, each with 12 physical hyper threaded cores and 24
condor slots with 24GB of RAM. 1 machine acts as a condor pool
collector and another is used for data processing in this scenario.
Here's a scenario, I have a script that submits 110 jobs in batches of 10 each.
The script is run once, 110 jobs get submitted within 1-4 (basically
no wait time) minutes and 24 slots are used on the 2nd machine.
Each job takes over an hour to execute.
The script is executed again soon after and it tries to submit 110
more jobs in batches of 10, while 80+ jobs are still in queue, but
this time the condor_submit fails to submit, instead it takes a very
long time for condor_submit to reply with:
Submitting job(s).........
ERROR: Failed to set Args="my args...." for job 2111.9 (110)

Luckily the script is set to resubmit to condor if the submit fails,
but it now takes up to an hour to submit all 110 jobs with sometimes
up to 15 resubmit retries.

The jobs are extremely disk intensive, utilizing 100% of RAID5 in 8
disk setup on the processing machine.

Also, when doing condor_q -g, the collector fails to reply and says
something along the lines of failure to retrieve class ad message,
while the submission of jobs script is executing.

I'd like to understand what is causing this and how it can be fixed.
Initially to me it seems like a disk IO issue that translates into
condor on 2nd machine being unresponsive to collector on the 1st
machine causing submission issues.

Note: Scientific Linux v6.4 OS is on the RAID5, it's planned to be
moved to an SSD soon, but will it help, and should it be done now to
fix this issue?

Thank you

Andrey Kuznetsov <akuznet1@xxxxxxxx>