[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] have people seen scalability issues with condor submission using the python bindings?



With current releases of HTCondor (v8.7.9+), it is easy to submit millions of jobs in a fraction of a second from either Python or the command-line if the jobs are submitted as one big cluster containing lots of jobs. To submit really large batches of jobs, we want to invoke condor_submit (or Python Submit.queue() method) just once to queue up a big bag of jobs, and include a 'late materialization' attribute in the job description like
  materialize_max_idle = X
where X is the maximum number of non-running jobs to have "materialized" in the queue at any one time.   So if materialize_max_idle=50, the schedd will create 50 instances of your job in the queue immediately... if 30 of them find matching slots and start running, the schedd will immediately create 20 more idle instances.

To try it out, download HTCondor v8.7.9+ and then in the condor_config[.local] file on your schedd machine add:
  
  # Enable schedd late materialization feature.  Soon HTCondor
  # will have this feature enabled by default, but as of
  # HTCondor v8.7.9 we still need to opt in.
  SCHEDD_ALLOW_LATE_MATERIALIZE = True   

Next, here is an example of submitting 500,000 jobs from Python.  I just tried this on on my Windows laptop, and submitting 500k jobs took a fraction of a second.

   import htcondor
   schedd = htcondor.Schedd()
   // Create a Submit object, initializing it with a job
   // description that has the exact same format as condor_submit tool.
   sub = htcondor.Submit('''
     executable = /bin/sleep
     materialize_max_idle = 50
     arguments = $(Process)
     queue 5000000
   ''')
   with schedd.transaction() as txn:
    jobClusterId = sub.queue(txn)

After running the above, jobClusterId contains the cluster id of the submission.  In my case it was 21.  If I do "condor_q 21", I see

   -- Schedd: localhost : <192.168.80.154:9618?... @ 08/21/18 00:53:24
   OWNER    BATCH_NAME    SUBMITTED   DONE   RUN    IDLE  TOTAL JOB_IDS
   tannenba ID: 21       8/21 00:49     81     16     50 500000 21.81-146

I can remove all 500,000 jobs super fast as well via just 'condor_rm 21' (or from Python via a constraint of ClusterId==21).

Hope the above helps,
Todd



On 8/21/2018 11:00 AM, Bob Ball wrote:
> I have not tried python, and not that version of Condor, but I have seen 
> in the past where too many jobs submitted at once has overwhelmed Condor 
> job submission.
> 
> bob
> 
> On 8/21/2018 11:38 AM, Jose Caballero wrote:
>> 2018-08-21 11:28 GMT-04:00 Jose Caballero <jcaballero.hep@xxxxxxxxx>:
>>> Hi,
>>>
>>> I am observing what I believe are some scale problems trying to submit
>>> using the python bindings.
>>> Version of condor is 8.6.12
>>> My application has multiple threads, and when they all try to submit
>>> almost at the same time, using the same Schedd, around 40% of them
>>> succeed and ~60% fail.
>>> I know I should write the code smarter, maybe some thread locking, or
>>> similar trick.
>>> But, in any case, I am just wondering if people have observed a
>>> similar behavior. And, in that case, how they fixed it.
>>>
>>> Cheers,
>>> Jose
>> I think I forgot to include the error message :)
>>
>> ÂÂÂÂ with self.schedd.transaction() as txn:
>> RuntimeError: Failed to connect to schedd.
>>
>> where self.schedd is an instance of htcondor.Schedd()
>> _______________________________________________
>> HTCondor-users mailing list
>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx 
>> with a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>
>> The archives can be found at:
>> https://lists.cs.wisc.edu/archive/htcondor-users/
>>
> 
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/


-- 
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing   Department of Computer Sciences
HTCondor Technical Lead                1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132                  Madison, WI 53706-1685