[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Delaying job starts for a cluster of jobs



Hi Jaime,

Thanks for the quick answer. I got confused with the job being
immediately in "R" state but it works. Well, actually it works most of
the times but it's not perfect all the time (maybe because of the time
between negotiation cycles).
Anyway, it solves my problem.

Cheers,
Mathieu

On 09/10/17 19:00, htcondor-users-request@xxxxxxxxxxx wrote:
> Send HTCondor-users mailing list submissions to
> 	htcondor-users@xxxxxxxxxxx
>
> To subscribe or unsubscribe via the World Wide Web, visit
> 	https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> or, via email, send a message with subject or body 'help' to
> 	htcondor-users-request@xxxxxxxxxxx
>
> You can reach the person managing the list at
> 	htcondor-users-owner@xxxxxxxxxxx
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of HTCondor-users digest..."
>
>
> Today's Topics:
>
>    1. Re: Delaying job starts for a cluster of jobs (Jaime Frey)
>    2. Re: condor_history JSON malformed with multiple	files (Jaime Frey)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Mon, 09 Oct 2017 16:06:16 +0000
> From: Jaime Frey <jfrey@xxxxxxxxxxx>
> To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
> Subject: Re: [HTCondor-users] Delaying job starts for a cluster of
> 	jobs
> Message-ID: <CC946B3B-900B-4C7D-9835-067A2BFFBF44@xxxxxxxxxxx>
> Content-Type: text/plain; charset=utf-8
>
>> On Oct 6, 2017, at 11:14 AM, Mathieu Bahin <mathieu.bahin@xxxxxxxxxxxxxxx> wrote:
>>
>> One of our user would like to run 100 jobs, ten by ten (well limited by
>> the concurrency_limits) but at the time when the job is submitted, the
>> first 10 jobs start running at the same time and for 1 minute, they are
>> loading heavy data and freeze the cluster.
>> So we would like to start them only one per minute. Of course, the ten
>> first don't finish exactly at the same time for the 90 last ones, we
>> don't experience this problem.
>>
>> I've read stuff about "next_job_start_delay" but I can't make it work
>> (and by the way, I can't find the "MAX_NEXT_JOB_START_DELAY" in our
>> config, is it a problem). And I've also read that "next_job_start_delay"
>> is not used anymore.
>> For now, the user managed to do it using deferral mechanism but it's not
>> very elegant!
> Setting next_job_start_delay in the submit description file should do what you need. One deceptive detail is that all 10 jobs will probably enter the running status in condor_q at the same time. HTCondor marks the jobs as running as soon as it allocates resources to them (which isn?t affected by next_job_start_delay). But the actual start of execution will be delayed (visible via the execution event in the job event log).
>
> The HTCondor manual does say "This command is no longer useful? about next_job_start_delay, but it?s still supported. That statement applies to cases where next_job_start_delay is used to limit the number of jobs that transfer data from the submit machine at the same time. There are better ways to control that. If your jobs are loading heavy data from the submit machine via HTCondor?s file transfer mechanism, you should look at the configuration parameters MAX_CONCURRENT_DOWNLOADS, MAX_CONCURRENT_UPLOADS, and FILE_TRANSFER_DISK_LOAD_THROTTLE.
>
> Thanks and regards,
> Jaime Frey
> UW-Madison HTCondor Project
>
>
>
>
> ------------------------------
>
> Message: 2
> Date: Mon, 09 Oct 2017 16:55:29 +0000
> From: Jaime Frey <jfrey@xxxxxxxxxxx>
> To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
> Subject: Re: [HTCondor-users] condor_history JSON malformed with
> 	multiple	files
> Message-ID: <D4410AAA-BCAD-4DA3-8A5F-78B1AA4FFEB7@xxxxxxxxxxx>
> Content-Type: text/plain; charset="utf-8"
>
> On Oct 5, 2017, at 8:55 AM, Fischer, Max (SCC) <max.fischer@xxxxxxx<mailto:max.fischer@xxxxxxx>> wrote:
>
> I am seeing some spurious errors when dumping condor_history as json. After about 4000/26000 jobs, there is a list close/open inserted [1].
> So, the JSON is formatted as
> ..., {...}][, {...}, ...
> when it should be
> ..., {...}, {...}, ...
> with {...} as individual job data mappings.
> The attributes do not seem to matter, even if I have all, 1 or no attribute, it breaks at the same number of jobs [2].
>
> It seems the error is that condor_history adds the list open/close for every file read [3], and we have multiple history files. If files are read by the schedd via `-name $(hostname)`, the problem does not occur [4].
>
> That?s definitely not right. The problem appears to be exactly as you describe, where ads from each history file are printed in their own list, with nothing between the list begin/end delimiters. The same problem happens with the -xml option. We will work on a fix for the next release.
> Thank you for the detailed report.
>
> Thanks and regards,
> Jaime Frey
> UW-Madison HTCondor Project
>
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <https://www-auth.cs.wisc.edu/lists/htcondor-users/attachments/20171009/e337a08c/attachment.html>
>
> ------------------------------
>
> Subject: Digest Footer
>
> _______________________________________________
> HTCondor-users mailing list
> HTCondor-users@xxxxxxxxxxx
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> ------------------------------
>
> End of HTCondor-users Digest, Vol 47, Issue 8
> *********************************************

-- 
---------------------------------------------------------------------------------------
| Mathieu Bahin
| IE CNRS
|
| Institut de Biologie de l'Ecole Normale SupÃrieure (IBENS)
| Biocomp team
| 46 rue d'Ulm
| 75230 PARIS CEDEX 05
| 01.44.32.23.56
---------------------------------------------------------------------------------------