[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Parallel schedd starts two jobs on the same slot.



Hello,

We have HTCondor v8.9.11 cluster which starts parallel tasks with dynamic slots via dedicated scheduler.

Sometimes schedd crashes when it trying to release claim for already deleted match record. I managed to trace this to DedicatedScheduler::createAllocations function and found that sometimes schedd uses match record from already running job as a slot for a new job. This happened because match record state is changed from M_ACTIVE to M_CLAIMED here: https://github.com/htcondor/htcondor/blob/master/src/condor_schedd.V6/schedd.cpp#L7795 . If I forbid change of M_ACTIVE state schedd does not crash. But it seems to me that I hide a real source of problem instead of fixing it. Can anyone advice where else I can look to trace this bug? 

----------
Sergey Komissarov
Senior Software Developer
DATADVANCE

This message may contain confidential information
constituting a trade secret of DATADVANCE. Any distribution,
use or copying of the information contained in this
message is ineligible except under the internal
regulations of DATADVANCE and may entail liability in
accordance with the current legislation of the Russian
Federation. If you have received this message by mistake
please immediately inform me of it. Thank you!