[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] thoughts on HTCondor python bindings submit improvements



Internally to the schedd jobs are stored as  1 very large classad called the cluster ad.  It is shared by all jobs in the cluster with a "jobid" of (ClusterId, -1).   Each job is  1  tiny classad with a pointer to the cluster ad, usually containing only 1 or two attributes (i.e. JobStatus)
The keys for these job ads are (ClusterId,ProcId) where ProcId's start at 0 and count up.   you can see this if you look at the job_queue.log.

What condor_submit sends to the SCHEDD on submission is *explicitly* this.   So instead of sending 70 or 80 attributes per job to the schedd, it sends 70 or 80 attributes once + 2 attributes per job.   This is a HUGE difference from the perspective of the schedd.  

Inside the bindings,  

    with transaction() as txn : for d in list_of_dicts : sub.queue(d)

Would mean that the c++ code of the sub.queue() method gets called many times, but has no way of knowing which is the last call.    The first call can be identified, and in HTCondor 8.7.8 we use this information to send the clusterad ONLY on the first call, and to send the tiny job ads for each subsequent call to sub.queue().  

With a new feature called "late materialization" condor_submit sends the cluster_ad + submit_file  to the schedd, no job ads are sent at all. Then the schedd uses the submit_file to create job classads as needed.  (in this case the submit file always ends in a "QUEUE <n> from <itemsfile>" statement). 

So to do late materialization submits from python bindings using the loop above,  the sub.queue statement would just append a line to the <itemsfile> of the QUEUE FROM statement - and then the *last* sub.queue() call would transmit the submit file and <itemsfile> to the schedd. 
But of course, there is no way to identify the last call when using the loop above.  Python knows, but it never tells our c++ code.

On the other hand, if the sub.queue() method is only called once, and passed an iterator to a list_of_dicts, the the c++ code *does* see the iterator stop, and at that point it can send the final <itemsfile> to the schedd. 

-tj

-----Original Message-----
From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of Dimitri Maziuk
Sent: Friday, April 27, 2018 12:20 PM
To: htcondor-users@xxxxxxxxxxx
Subject: Re: [HTCondor-users] thoughts on HTCondor python bindings submit improvements

On 04/27/2018 10:13 AM, John M Knoeller wrote:
...
> We could stop there, but I also thought it would be nice to submit using a native python iterator also, and it seemed to me that it is clearer to have the user always pass the iterator of the htcondor.Submit class to the queue statement explicitly rather than implicitly.

I think if your submit file parser returns a list of dicts, each
describing one job complete with executable and universe and all, then
the user could sub.queue(txn, dict) in a loop. Or sub.queue(txn,
**dict), whichever way you like it.

I'm not sure I get the bindings code argument, but then I haven't seen
it:  I assume there is a reason you can't do your "last job" magic in
the StopIteration catch block?

-- 
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu