[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Error submitting a large amount of jobs at the same time, multiple times.



I have a program that automatically makes a submit file, submits it, and processes the results.  Executing it once works fine and the output is as expected.  Recently, I'm trying to add support for running 3+ instances of this program at once, and running into errors when it submits the submit files.

Details are as follows.  There are 99 "queues" in each submit file.  Even when only submitting one of these files, it takes a very long time (on the order of 5 minutes) to go from a "submitting jobs" notification to the final output of "submitting jobs............ 99 jobs submitted to cluster x".  When I submit 3 separate submit files, each with 99 "queues", the first submit window says:

"Submitting jobs............. 
99 job(s) submitted to cluster 3271.
Can't send RESCHEDULE command to condor scheduler"

But seems to run successfully.  The second submit runs successfully (finishing after the first one, as expected), but the third submit completely errors out and says:

"Submitting job(s)
ERROR: Failed to create cluster"

Since it takes so long to finish the "Submitting jobs......" output, the three separate instances are on the "submitting" stage at the same exact time.  I do not understand why I'm having such problems submitting these jobs.  Does anyone have any ideas?  Should it take that long to 

My pool manager is also capable of submitting and running jobs.  I know that isn't extremely desirable but I don't believe it should be a major issue here since the machine has 8 cores and 16GB RAM running Windows 7 64 bit.

Thank you in advance for any thoughts, suggestions, or ideas!