[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Delaying job starts for a cluster of jobs



> On Oct 6, 2017, at 11:14 AM, Mathieu Bahin <mathieu.bahin@xxxxxxxxxxxxxxx> wrote:
> 
> One of our user would like to run 100 jobs, ten by ten (well limited by
> the concurrency_limits) but at the time when the job is submitted, the
> first 10 jobs start running at the same time and for 1 minute, they are
> loading heavy data and freeze the cluster.
> So we would like to start them only one per minute. Of course, the ten
> first don't finish exactly at the same time for the 90 last ones, we
> don't experience this problem.
> 
> I've read stuff about "next_job_start_delay" but I can't make it work
> (and by the way, I can't find the "MAX_NEXT_JOB_START_DELAY" in our
> config, is it a problem). And I've also read that "next_job_start_delay"
> is not used anymore.
> For now, the user managed to do it using deferral mechanism but it's not
> very elegant!

Setting next_job_start_delay in the submit description file should do what you need. One deceptive detail is that all 10 jobs will probably enter the running status in condor_q at the same time. HTCondor marks the jobs as running as soon as it allocates resources to them (which isnât affected by next_job_start_delay). But the actual start of execution will be delayed (visible via the execution event in the job event log).

The HTCondor manual does say "This command is no longer usefulâ about next_job_start_delay, but itâs still supported. That statement applies to cases where next_job_start_delay is used to limit the number of jobs that transfer data from the submit machine at the same time. There are better ways to control that. If your jobs are loading heavy data from the submit machine via HTCondorâs file transfer mechanism, you should look at the configuration parameters MAX_CONCURRENT_DOWNLOADS, MAX_CONCURRENT_UPLOADS, and FILE_TRANSFER_DISK_LOAD_THROTTLE.

Thanks and regards,
Jaime Frey
UW-Madison HTCondor Project