[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Parallel Scheduling - Handling of claims when jobs are on hold or are removed before starting



Just noticed recently the following behavior when using the parallel universe. Whenever a job is submitted using the parallel universe and this job starts claiming resources but has not started up, e.g., the job requests 5 machines/slots but only 4 are free and get claimed and the parallel job waits until a fifth slot gets available. If the job is removed from the queue or set on hold (condor_rm, condor_hold) the claims on the four machines/slots remain indefinitely (in my cases I waited several hours and the claims were still there blocking resources for the non-existent job). The only way to get rid of them was to send a condor_reconfig command to the affected startds.

The claims are released correctly when the parallel job already started, i.e., when a shadow already exists for it. This is reproducible at least in condor 8.4.7 which I'm currently using. Â