[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] reached max workers



On Tue, Feb 6, 2018 at 3:15 PM, Todd Tannenbaum <tannenba@xxxxxxxxxxx> wrote:
> On 2/6/2018 12:01 PM, Michael Pelletier wrote:
>> Hm, perhaps this is something different then. The query workers are used for condor_q and condor_status, so you shouldn't see them invoked during a condor_submit as I understand it.
>>
>
> Michael is correct ... as usual! :)
>
> The config knobs COLLECTOR_QUERY_WORKERS and SCHEDD_QUERY_WORKERS just control how many times maximum the collector and schedd are allowed to fork when servicing a condor_status or condor_q query, respectively.  I suggest leaving them alone at their defaults.
>
>> When submitting jobs from python I some times get an error connecting to schedd.
>
> What is the error you see in Python?

The code interacting directly with condor did not have any exception
handlers. But the code that called that did so in a try/except bock
and it got:

Failed to connect to schedd. When I looked in the schedd log I saw the message:

ForkWork: not forking because reached max workers 8

At the same time as the script printed its message. I have added
try/except bocks around all the condor commands so I can see which
command gets the error.

> Most common reason I have seen errors connecting to a schedd from Python is if a Python process attempts to open multiple connections to the same schedd (perhaps from multiple threads). Be aware that, at least for now, each Python process may only have one schedd transaction open at a time.  If you attempt to open a second schedd connection from the same process, it will fail.  If you attempt to open a second connection to a schedd from a different process, that second process will wait until the first process closes the connection.  As such, it is a good idea for your Python program to do minimal processing while the schedd connection is open, so that the connection may be closed as soon as possible.

My code does starts multiple threads and each thread submits 100's of
jobs and then waits for them to complete. This is a simplified version
of the code:

        coll = htcondor.Collector(condor_host)
        schedd_ad = coll.locate(htcondor.DaemonTypes.Schedd)
        schedd = htcondor.Schedd(schedd_ad)
        for job in jobs:
            submit_dict = {
                .
                .
                .
            }
        submits.append(htcondor.Submit(submit_dict))
        with schedd.transaction() as txn:
            for submit in submits:
                ids.append(submit.queue(txn))

        while len(ids) != len(completed_jobs):
            for id in ids:
                jobs = schedd.xquery(requirements="ClusterId == %d" % id)
                for job in jobs:
                    if job['ClusterId'] == id:
                        if job['JobStatus'] == 4:
                            completed_jobs.append(id)

So this same thing is going on in multiple threads in parallel. Would
that be the problem?


> Another possibility your schedd is overloaded for some reason.  Each schedd classad will have an attribute "RecentDaemonCoreDutyCycle" that serves as a load metric of sorts.  If this value is greater than 0.98 (98%), that could be the problems.  To view this value for all your submit machines in your pool do
>
>  condor_status -schedd -af name recentdaemoncoredutycycle

Is that show the value at the time I run the command or is it show a
max or cumulative value? When I run that now I get:

0.005028691028207022

>
>> Looking in the log I see this:
>>
>> ForkWork: not forking because reached max workers 8
>
> Apparently your schedd is receiving a lot of simultaneous query (or condor_q) calls, and/or a lot of very large queries.
>
> Note that doing a query of a schedd that has a lot (many thousands) of submitted jobs and asking for all of the attributes is expensive.  I.e. if you just need a few job classad attributes like owner, do this:
>
>    condor_q -all-users -af owner
>
> instead of
>
>    condor_q -all-users -l | grep -i owner
>
>>
>> Thanks. The doc says default for SCHEDD_QUERY_WORKERS is 3, but I am not setting it and I get the error that the max is 8.
>>
>
> Are you looking at the version of the manual that corresponds to the version of HTCondor you are running? Be warned that Google searches often end up pointing
> at ancient versions of the Manual.

Yes that was it. Google pointed at:
htcondor/manual/v7.8/3_3Configuration.html and I am running 8.6.8

> Current versions of the manual has the correct value; it looks like the documentation on this
> value was updated 3+ years ago - see
>
> http://research.cs.wisc.edu/htcondor/manual/current/3_5Configuration_Macros.html#25800
>
> Also I suggest always checking values with condor_config_val -v, like so:
>
> % condor_config_val -v schedd_query_workers
> SCHEDD_QUERY_WORKERS = 8
>  # at: <Default>
>  # raw: SCHEDD_QUERY_WORKERS = 8
>
>> In any case, if I have 264 compute nodes would I set that (and
>> COLLECTOR_QUERY_WORKERS) to 264 so I could use them all simultaneously?
>>
>
> Nope.  Again, I suggest you remove all references to these knobs and condor_reconfig.

I did not make any changes to those.

> Hope the above helps,

Yes, it does. Thank you so much!