[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Parallel Scheduling - Handling of claims when jobs are on hold or are removed before starting



On 08/13/2017 05:06 AM, Felix Wolfheimer wrote:
Just noticed recently the following behavior when using the parallel universe. Whenever a job is submitted using the parallel universe and this job starts claiming resources but has not started up, e.g., the job requests 5 machines/slots but only 4 are free and get claimed and the parallel job waits until a fifth slot gets available. If the job is removed from the queue or set on hold (condor_rm, condor_hold) the claims on the four machines/slots remain indefinitely (in my cases I waited several hours and the claims were still there blocking resources for the non-existent job). The only way to get rid of them was to send a condor_reconfig command to the affected startds.

Thank you for your very descriptive bug report. We've now fixed this in 8.6, but not in time to make the upcoming release. As you point out, the only workaround is to reconfig, or to run very short parallel jobs to consume the slots (perhaps even a one core job).

-greg