[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Submit transaction timeout



I think if you switch to using late materialization, you will no longer have trouble submitting all of the jobs within the timeout window.   

 

To use late materialization, just add the max_idle or max_materialize command to your Submit object.  This tends to speed up submission by orders of magnitude when there are a lot of jobs in the submit.

 

The condor_schedd is locked for other work during the time you are holding open the submit transaction,  so it is not a good idea to just increase the timeout.

 

I should also note that this pattern using schedd.transaction() like this

 

with schedd.transaction() as txn:

            submit.queue_with_itemdata(txn, 1, iter(itemdata_chunk))

 

is deprecated.

 

This pattern results in control going back and forth between c++ and python while a transaction is open in the schedd, leaving the schedd locked for a long period of time.

 

You should use schedd.submit instead.

 

sub = htcondor.Submit(….)
sub[‘max_materialize’] = chunk_size

schedd.submit(sub, count=1,  iter(itemdata_chunk))

 

Since the transaction in this model is implicit rather than explicit,  control does not go back to python while the connection to the schedd is open.

 

-tj

 

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Peet Whittaker
Sent: Tuesday, September 19, 2023 11:03 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: [HTCondor-users] Submit transaction timeout

 

Hi,

 

When using the Python API to submit a large number of jobs (10’s of thousands), we encounter the following error:

 

    RuntimeError: Failed to commit and disconnect from queue.

 

We use the following code to submit jobs in chunks:

 

    max_jobs_per_sub = htcondor.param['MAX_JOBS_PER_SUBMISSION']

    for itemdata_chunk in common.utils.iter_data_chunks(itemdata, max_jobs_per_sub):

        with schedd.transaction() as txn:

            submit.queue_with_itemdata(txn, 1, iter(itemdata_chunk))

 

If we use a smaller chunk size (say 5,000 rather than the default 20,000), we still encounter the error once a certain number of jobs have been submitted (usually around 30-50k).

 

Looking at the logs and based on this message thread it would seem that we’re hitting the schedd’s 20 second transaction timeout. Is there any way of increasing or avoiding this timeout?

 

The pool and central manager all run on Windows.

 

Kind regards,

 

Peet Whittaker

Discipline Lead for DevOps | Principal Software Developer

 

JBA Consulting, 1 Broughton Park, Old Lane North, Broughton, Skipton, North Yorkshire, BD23 3FD. Telephone: +441756699500

Visit our new website at  www.jbaconsulting.com.

This email is covered by the JBA Consulting email disclaimer
JBA Consulting is a trading name of Jeremy Benn Associates Limited, registered in England, company number 03246693, 1 Broughton Park, Old Lane North, Broughton, Skipton, North Yorkshire, BD23 3FD.

Image removed by sender. JBA CONSULTING