I’m working on converting some jobs over from condor vanilla universe into the grid universe to execute on GE nodes. Some of these jobs are DAGS with thousands of jobs and I’m seeing the load on the condor scheduler go through the roof when these jobs are in the queue because it is constantly running qstat commands to get info about the jobs. This is pretty understandable when you have that many jobs in the queue..
But what I don’t understand is the logic that went into making the calls to the sge_helper script. From what I can see in our environment, it is only ever called with ONE job id at a time, but there’s all this logic in the script to support being called with a bunch of job IDs across multiple GE environments. If it were called with multiple job IDs, I could see that there could be some efficiencies gained by only calling qsub once, and turning that whole xml tree of info into usable data a single time. But instead, the helper gets called over and over and over, once per job, per GridManager job probe interval, resulting in thousands and thousands of qstat calls.
I see that sge_helper is only called from sge_status, but sge_status is also only ever called with one job id as an argument, and I don’t see any logic in sge_status to support being called with multiple job ids, so why is sge_helper written this way?
I think I want to re-write sge_helper to make it more streamlined but I’m afraid I’m missing a scenario or making a poor assumption and that I may break things entirely.
Does anyone have any guidance on this?