[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Failed to connect to schedd



I am submitting jobs from python in a loop that has this:

    sub = htcondor.Submit(submit_dict)
    with schedd.transaction() as txn:
        id = sub.queue(txn)

I want to submit thousands of jobs, each one with a different
submit_dict. What happens is the first 24 get submitted, then I start
to get 'Failed to connect to schedd' from the call to
schedd.transaction().

I'll get that twice, then I can submit 12 jobs, then I get the error
once, then I can submit 6 jobs. It continues like this, a few errors,
a few successful submits.

This is my MAX_JOBS_RUNNING setting on the master:

condor_config_val MAX_JOBS_RUNNING
MIN({23933, 10000})

And this is it on both execute hosts:

condor_config_val MAX_JOBS_RUNNING
MIN({128651, 10000})

condor_status shows 352 slots available.

I don't see any errors in the submit log. Anyone know how I can fix
this and/or debug it further?