[HTCondor-users] Parallel Scheduling - Handling of claims when jobs are on hold or are removed before starting

Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

Date: Sun, 13 Aug 2017 12:06:33 +0200

From: Felix Wolfheimer <f.wolfheimer@xxxxxxxxxxxxxx>

Subject: [HTCondor-users] Parallel Scheduling - Handling of claims when jobs are on hold or are removed before starting

Just noticed recently the following behavior when using the parallel universe. Whenever a job is submitted using the parallel universe and this job starts claiming resources but has not started up, e.g., the job requests 5 machines/slots but only 4 are free and get claimed and the parallel job waits until a fifth slot gets available. If the job is removed from the queue or set on hold (condor_rm, condor_hold) the claims on the four machines/slots remain indefinitely (in my cases I waited several hours and the claims were still there blocking resources for the non-existent job). The only way to get rid of them was to send a condor_reconfig command to the affected startds.

The claims are released correctly when the parallel job already started, i.e., when a shadow already exists for it. This is reproducible at least in condor 8.4.7 which I'm currently using. Â

Mailing List Archives

Public Access

[HTCondor-users] Parallel Scheduling - Handling of claims when jobs are on hold or are removed before starting