[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Error submitting a large amount of jobs at the same time, multiple times.



I made a simple change, putting the log file on a local drive, and it submitted in a few seconds.  Clearly the log file residing on a network drive is the issue.  I can make a work around in the program for now, but I still would like to know if there's anyway to fix this delay since in many of our applications it's convenient to put this log file on a network drive.  I'm also curious if setting the "InitialDir" to be on a network drive has similar issues.

Thanks everyone!
Mike


On Fri, Dec 14, 2012 at 10:21 AM, Michael Aschenbeck <m.g.aschenbeck@xxxxxxxxx> wrote:
Thanks so much for your reply!

You are definitely right, the long submit time is the root of the problem, so I have focused on that.

I looked into your suggestions yesterday and this morning.  Unfortunately, my pool is not busy and I have been the only one submitting jobs more or less.  So this problem is isolated to this single submission.  Further, I do not copy to spool.  I tried skipping filechecks, but the result was the exact same long submit time.

One thing that I DID do was set SCHEDD_DEBUG = D_FULLDEBUG.  I think this showed me what the problem is, but I'm not sure what is causing it. The following consecutive lines appear 99 times (one for each job):

12/14/12 09:44:29 FileLock::obtain(1) - @1355503469.370000 lock on \\xx.xx.29.28\randd\Test\tiger1\log_201212149748.txt now WRITE
12/14/12 09:44:29 FileLock::obtain(2) - @1355503469.371000 lock on \\xx.xx.29.28\randd\Test\tiger1\log_201212149748.txt now UNLOCKED
12/14/12 09:44:29 New job: 3283.91
12/14/12 09:44:29 Writing record to user logfile=\\xx.xx.29.28\randd\Test\tiger1\log_201212149748.txt owner=mike056
12/14/12 09:44:29 init_user_ids: want user 'mike056@GEOEYE', current is '(null)@(null)'
12/14/12 09:44:29 init_user_ids: Already have handle for mike056@GEOEYE, so returning.
12/14/12 09:44:29 TokenCache contents: 
mike056@GEOEYE
12/14/12 09:44:29 FileLock object is updating timestamp on: \\xx.xx.29.28\randd\Test\tiger1\log_201212149748.txt
12/14/12 09:44:34 FileLock::updateLockTime(): utime() failed 22(Invalid argument) on lock file \\xx.xx.29.28\randd\Test\tiger1\log_201212149748.txt. Not updating timestamp.

From the last two lines, this shows me it's trying to "update the timestamp" for 5 seconds and can't.  Since this happens 99 times, this comes out to 8-9 minutes - which is the time delay in question!  However, the weird thing is that it appears to have no problem with the permissions on the log file from the first two lines.

The log file _is_ on a network drive, however it is a fast drive (1 gigabit) and I should have full read/write permissions.  Any ideas on this?


On Thu, Dec 13, 2012 at 3:43 PM, Todd Tannenbaum <tannenba@xxxxxxxxxxx> wrote:
On 12/13/2012 4:03 PM, Michael Aschenbeck wrote:
I have a program that automatically makes a submit file, submits it, and
processes the results.  Executing it once works fine and the output is
as expected.  Recently, I'm trying to add support for running 3+
instances of this program at once, and running into errors when it
submits the submit files.

Details are as follows.  There are 99 "queues" in each submit file.
  Even when only submitting one of these files, it takes a very long
time (on the order of 5 minutes) to go from a "submitting jobs"
notification to the final output of "submitting jobs............ 99 jobs
submitted to cluster x".

Before even going into the 3+ instances at once, I'd suggest getting to the bottom of why submitting 99 jobs from one instance of condor_submit takes 5 minutes.  On my Windows 7 laptop it takes a couple seconds.

When you say it takes 5+ minutes, does it take this long submitting into an empty queue?  If it is taking this long submitting to a schedd that already has jobs queued, how many of those jobs are running, and for how long do they typically run?  Do your jobs run only for a couple seconds? One guess for the slowness: if you have many jobs completing every second, it could be either the condor_schedd process is swamped or the disk file system holding the job queue is swamped.  If this is the case, the best thing to do would be to partition your work such that you have slightly less jobs that run slightly longer, or you could try tweaks like setting in your condor_config
  CONDOR_FSYNC = False
( see http://research.cs.wisc.edu/htcondor/manual/v7.9/3_3Configuration.html#16687 )
For really busy production submit points (we are talking thousands of simultaneously running jobs from one submit machine), some folks go with an SSD drive to hold the SPOOL directory or at least the contents of the job queue.  See condor_config knob "JOB_QUEUE_LOG" at

http://research.cs.wisc.edu/htcondor/manual/v7.9/3_3Configuration.html#16742

Another guess: perhaps you telling HTCondor to spool your executable or input files, perhaps via the "-remote" or "-spool" option to condor_submit ?  Spooling input files to a submit machine running on Windows can currently be slow because input spooling on Windows blocks the scheduler from doing other work (like accept new job submits) - note on Linux things are much faster because the input spooling occurs in a child process. Could you share your submit file, any command-line flags being passed to condor_submit, and the output of condor_version?

Also take a peek at the following condor_submit settings to speed up job submission (see http://research.cs.wisc.edu/htcondor/manual/v7.9/condor_submit.html) :

   skip_filechecks = True
   copy_to_spool = False  (this is the default on current versions of HTCondor, but if you are running an older release it could be an issue)

hope the above helps,
Todd