[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] parallel jobs and partitionable slots - jobs run very slowly



Hi!

I have a problem with partitionable slots and parallel jobs.
Jobs are so large that two jobs cannot run simultaneously. Jobs have different requirements so dynamic slots created for one job are rarely suitable for another one.

To speedup creating of dynamic slots I set the following parameters:
CLAIM_PARTITIONABLE_LEFTOVERS = FALSE
CONSUMPTION_POLICY = TRUE

When the number of running jobs and jobs in the queue is low everything is fine, HTCondor creates dynamic slots, runs jobs, and deletes dynamic slots after they were used.

But as the number of jobs in queue grows running becomes slower and slower. There are big intervals between one job finished and next job started (10-30 minutes and more).
At some moment HTCondor may stop running jobs completely (for hours). I see that dynamic slots are being created and claimed by different jobs and then released after 10 minutes of inactivity.
No job can get required amount of slots to run.

Is there any solution for this?
Is it possible to tell HTCondor not to try to match multiple jobs at a time, just match first all slots for first job, run it and only then to process next job?

Thanks.


Best regards,
Stanislav V. Markevich