[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Parallel schedd starts two jobs on the same slot.



Hello,

I traced worklife of the problematic match record inside schedd, this is what happened to the match_rec object:

2021-11-26T16:32:08.814273712Z condor_schedd[128]: MATCH CREATED 0x557f1ca91ff0
2021-11-26T16:32:08.814277098Z condor_schedd[128]: Scheduler::enqueueStartdContact MATCH 0x557f1ca91ff0
2021-11-26T16:32:08.816390685Z condor_schedd[128]: #1    condor_schedd: Scheduler::enqueueStartdContact(ContactStartdArgs*)
2021-11-26T16:32:08.816616732Z condor_schedd[128]: #2    condor_schedd: DedicatedScheddNegotiate::scheduler_handleMatch(PROC_ID, char const*, char const*, classad::ClassAd&, char const*)
2021-11-26T16:32:08.816622643Z condor_schedd[128]: #3    condor_schedd: ScheddNegotiate::messageReceived(DCMessenger*, Sock*)
2021-11-26T16:32:08.816626069Z condor_schedd[128]: #4    /usr/sbin/../lib/libcondor_utils_8_9_11.so: DCMsg::callMessageReceived(DCMessenger*, Sock*)
2021-11-26T16:32:08.816636429Z condor_schedd[128]: #5    /usr/sbin/../lib/libcondor_utils_8_9_11.so: DCMessenger::readMsg(classy_counted_ptr<DCMsg>, Sock*)
2021-11-26T16:32:08.816640667Z condor_schedd[128]: #6    /usr/sbin/../lib/libcondor_utils_8_9_11.so: DCMessenger::receiveMsgCallback(Stream*)
2021-11-26T16:32:08.816652269Z condor_schedd[128]: #7    /usr/sbin/../lib/libcondor_utils_8_9_11.so: DaemonCore::CallSocketHandler_worker(int, bool, Stream*)
2021-11-26T16:32:08.816670393Z condor_schedd[128]: #8    /usr/sbin/../lib/libcondor_utils_8_9_11.so: DaemonCore::CallSocketHandler_worker_demarshall(void*)
2021-11-26T16:32:08.816674661Z condor_schedd[128]: #9    /usr/sbin/../lib/libcondor_utils_8_9_11.so: CondorThreads::pool_add(void (*)(void*), void*, int*, char const*)
2021-11-26T16:32:08.816677827Z condor_schedd[128]: #10    /usr/sbin/../lib/libcondor_utils_8_9_11.so: DaemonCore::CallSocketHandler(int&, bool)
2021-11-26T16:32:08.828848666Z condor_schedd[128]: match_rec::setStatus MATCH 0x557f1ca91ff0 M_UNCLAIMED -> M_STARTD_CONTACT_LIMBO
2021-11-26T16:32:08.829251446Z condor_schedd[128]: #1    condor_schedd: match_rec::setStatus(int)
2021-11-26T16:32:08.829265503Z condor_schedd[128]: #2    condor_schedd: Scheduler::contactStartd(ContactStartdArgs*)
2021-11-26T16:32:08.829269200Z condor_schedd[128]: #3    condor_schedd: Scheduler::checkContactQueue()
2021-11-26T16:32:08.829272175Z condor_schedd[128]: #4    /usr/sbin/../lib/libcondor_utils_8_9_11.so: TimerManager::Timeout(int*, double*)
2021-11-26T16:32:08.961606335Z condor_schedd[128]: match_rec::setStatus 0x557f1ca91ff0 M_STARTD_CONTACT_LIMBO -> M_CLAIMED
2021-11-26T16:32:08.976796901Z condor_schedd[128]: DedicatedScheduler::createAllocations PROC 11.7 MATCH 0x557f1ca91ff0 SLOT slot1_9@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
2021-11-26T16:32:09.247252838Z condor_schedd[128]: match_rec::setStatus MATCH 0x557f1ca91ff0 M_CLAIMED -> M_ACTIVE
2021-11-26T16:32:09.248605159Z condor_schedd[128]: Adding submitter DedicatedScheduler@parallel_schedd@submit.pseven-htcondor to the submitter map for default pool.
2021-11-26T16:32:09.256647951Z condor_schedd[128]: Negotiation ended - 14 jobs matched
2021-11-26T16:32:09.256875280Z condor_schedd[128]: Finished negotiating for DedicatedScheduler@parallel_schedd in local pool: 14 matched, 40 rejected

Nothing in the log between this lines, seems that match_rec::setStatus is called right after previous callback is finished.
I can not figure out who is responsible for calling this callback. Seems that this is message from negotiator, but I have no proof. Note that Scheduler::enqueueStartdContact is not called for this match again.

2021-11-26T16:32:09.257273140Z condor_schedd[128]: match_rec::setStatus MATCH 0x557f1ca91ff0 M_ACTIVE -> M_CLAIMED
2021-11-26T16:32:09.261622555Z condor_schedd[128]: #1    condor_schedd: match_rec::setStatus(int)
2021-11-26T16:32:09.261730078Z condor_schedd[128]: #2    condor_schedd: Scheduler::claimedStartd(DCMsgCallback*)
2021-11-26T16:32:09.261816792Z condor_schedd[128]: #3    /usr/sbin/../lib/libcondor_utils_8_9_11.so: DCMsg::doCallback()
2021-11-26T16:32:09.263667723Z condor_schedd[128]: #4    /usr/sbin/../lib/libcondor_utils_8_9_11.so: DCMsg::callMessageReceived(DCMessenger*, Sock*)
2021-11-26T16:32:09.267708717Z condor_schedd[128]: #5    /usr/sbin/../lib/libcondor_utils_8_9_11.so: DCMessenger::readMsg(classy_counted_ptr<DCMsg>, Sock*)
2021-11-26T16:32:09.267724757Z condor_schedd[128]: #6    /usr/sbin/../lib/libcondor_utils_8_9_11.so: DCMessenger::receiveMsgCallback(Stream*)
2021-11-26T16:32:09.267729035Z condor_schedd[128]: #7    /usr/sbin/../lib/libcondor_utils_8_9_11.so: DaemonCore::CallSocketHandler_worker(int, bool, Stream*)
2021-11-26T16:32:09.267732391Z condor_schedd[128]: #8    /usr/sbin/../lib/libcondor_utils_8_9_11.so: DaemonCore::CallSocketHandler_worker_demarshall(void*)
2021-11-26T16:32:09.267812062Z condor_schedd[128]: #9    /usr/sbin/../lib/libcondor_utils_8_9_11.so: CondorThreads::pool_add(void (*)(void*), void*, int*, char const*)
2021-11-26T16:32:09.267819956Z condor_schedd[128]: #10    /usr/sbin/../lib/libcondor_utils_8_9_11.so: DaemonCore::CallSocketHandler(int&, bool)

Match in CLAIMED state is available for jobs. So later schedd assigns it for another job:

2021-11-26T16:32:22.231323105Z condor_schedd[128]: DedicatedScheduler::createAllocations PROC 10.3 MATCH 0x557f1ca91ff0 SLOT slot1_9@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
2021-11-26T16:32:22.565267148Z condor_schedd[128]: match_rec::setStatus MATCH 0x557f1ca91ff0 M_CLAIMED -> M_ACTIVE

Execute node refuses to start another job on busy slot.

2021-11-26T16:32:24.133959205Z condor_schedd[128]: DedicatedScheduler::shutdownMpiJob PROC 10.3 KILL MATCH 0x557f1ca91ff0
2021-11-26T16:32:24.133966438Z condor_schedd[128]: DedicatedScheduler::DelMrec PROC 10.3 DESTROY MATCH 0x557f1ca91ff0 SLOT slot1_9@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

And now first job with this match finished and schedd tries to deactivate claim for already deleted match object:

2021-11-26T16:35:07.155269602Z condor_shadow[660]: RemoteResource::killStarter(): Could not send command to startd
2021-11-26T16:35:07.155310289Z condor_schedd[128]: Got RELEASE_CLAIM from <10.42.0.211:42265>
2021-11-26T16:35:07.155316030Z condor_schedd[128]: DedicatedScheduler::DelMrec PROC 11.0 DESTROY MATCH 0x557f1c9eafb0 SLOT slot1_1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
2021-11-26T16:35:07.184664624Z condor_shadow[660]: Job 11.0 terminated: exited with status 0
2021-11-26T16:35:07.184680945Z condor_shadow[660]: Reporting job exit reason 100 and attempting to fetch new job.
2021-11-26T16:35:07.184684972Z condor_shadow[660]: **** condor_shadow (condor_SHADOW) pid 660 EXITING WITH STATUS 100
2021-11-26T16:35:07.184688038Z condor_schedd[128]: In DedicatedScheduler::reaper pid 660 has status 25600
2021-11-26T16:35:07.184690954Z condor_schedd[128]: Shadow pid 660 exited with status 100
2021-11-26T16:35:07.187010088Z condor_schedd[128]: DedicatedScheduler::shutdownMpiJob PROC 11.7 KILL MATCH 0x557f1ca91ff0
2021-11-26T16:35:07.247719964Z condor_master[86]: DefaultReaper unexpectedly called on pid 128, status 139.
2021-11-26T16:35:07.247751083Z condor_master[86]: The SCHEDD (pid 128) died due to signal 11 (Segmentation fault)

Core was generated by `condor_schedd'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f348cf5455b in kill () at ../sysdeps/unix/syscall-template.S:78
78      ../sysdeps/unix/syscall-template.S: No such file or directory.
(gdb) bt
#0  0x00007f348cf5455b in kill () at ../sysdeps/unix/syscall-template.S:78
#1  0x00007f348dc97e1f in unix_sig_coredump (s_info=<optimized out>, signum=11) at /HTCondor-src/src/condor_daemon_core.V6/daemon_core_main.cpp:1334
#2  unix_sig_coredump (signum=11, s_info=<optimized out>) at /HTCondor-src/src/condor_daemon_core.V6/daemon_core_main.cpp:1254
#3  <signal handler called>
#4  __strlen_avx2 () at ../sysdeps/x86_64/multiarch/strlen-avx2.S:96
#5  0x00007f348cfb0503 in __GI___strdup (s=s@entry=0x7177646e7a2d3962 <error: Cannot access memory at address 0x7177646e7a2d3962>) at strdup.c:41
#6  0x00007f348dc282b4 in ReliSock::connect (non_blocking_flag=false, port=0, host=0x7177646e7a2d3962 <error: Cannot access memory at address 0x7177646e7a2d3962>, 
    this=0x7ffec8dc2870) at /HTCondor-src/src/condor_io/reli_sock.cpp:277
#7  ReliSock::connect (this=this@entry=0x7ffec8dc2870, host=0x7177646e7a2d3962 <error: Cannot access memory at address 0x7177646e7a2d3962>, port=port@entry=0, 
    non_blocking_flag=non_blocking_flag@entry=false) at /HTCondor-src/src/condor_io/reli_sock.cpp:264
#8  0x0000557f1b07ac59 in DedicatedScheduler::deactivateClaim (this=0x557f1b1a2d80 <dedicated_scheduler>, m_rec=0x557f1ca91ff0)
    at /HTCondor-src/src/condor_schedd.V6/dedicated_scheduler.cpp:985
#9  0x0000557f1b07b339 in DedicatedScheduler::shutdownMpiJob (this=0x557f1b1a2d80 <dedicated_scheduler>, srec=0x557f1ca586e0, kill=false)
    at /HTCondor-src/src/condor_schedd.V6/dedicated_scheduler.cpp:3380
#10 0x0000557f1b07b5e1 in DedicatedScheduler::reaper (this=0x557f1b1a2d80 <dedicated_scheduler>, pid=660, status=<optimized out>)
    at /HTCondor-src/src/condor_schedd.V6/dedicated_scheduler.cpp:1212
#11 0x00007f348dc80cf9 in DaemonCore::CallReaper (this=0x557f1c505bb0, reaper_id=<optimized out>, whatexited=<optimized out>, pid=660, exit_status=25600)
    at /HTCondor-src/src/condor_daemon_core.V6/daemon_core.cpp:9763
#12 0x00007f348dc83c09 in DaemonCore::HandleProcessExit (this=0x557f1c505bb0, pid=<optimized out>, exit_status=25600) at /HTCondor-src/src/condor_daemon_core.V6/daemon_core.cpp:9862
#13 0x00007f348dc83d1e in DaemonCore::HandleDC_SERVICEWAITPIDS (this=0x557f1c505bb0) at /HTCondor-src/src/condor_daemon_core.V6/daemon_core.cpp:9424
#14 0x00007f348dc8421d in DaemonCore::Driver (this=0x557f1c505bb0) at /HTCondor-src/src/condor_daemon_core.V6/daemon_core.cpp:3444
#15 0x00007f348dc9c286 in dc_main (argc=<optimized out>, argv=<optimized out>) at /HTCondor-src/src/condor_daemon_core.V6/daemon_core_main.cpp:4352
#16 0x00007f348cf350b3 in __libc_start_main (main=0x557f1b064f60 <main(int, char**)>, argc=1, argv=0x7ffec8dc3758, init=<optimized out>, fini=<optimized out>, 
    rtld_fini=<optimized out>, stack_end=0x7ffec8dc3748) at ../csu/libc-start.c:308
#17 0x0000557f1b06505e in _start () at /usr/include/c++/9/ext/new_allocator.h:89
(gdb) f 8
#8  0x0000557f1b07ac59 in DedicatedScheduler::deactivateClaim (this=0x557f1b1a2d80 <dedicated_scheduler>, m_rec=0x557f1ca91ff0)
    at /HTCondor-src/src/condor_schedd.V6/dedicated_scheduler.cpp:985
985             if(sock.connect(m_rec->peer, 0) ) {

What can be the reason for changing match state to CLAIMED? I have full logs from this run and I can add traces to any condor_* binary, but now I'm clueless where from should I start.

----------
Sergey Komissarov
Senior Software Developer
DATADVANCE

This message may contain confidential information
constituting a trade secret of DATADVANCE. Any distribution,
use or copying of the information contained in this
message is ineligible except under the internal
regulations of DATADVANCE and may entail liability in
accordance with the current legislation of the Russian
Federation. If you have received this message by mistake
please immediately inform me of it. Thank you!

----- Original Message -----
From: "htcondor-users" <htcondor-users@xxxxxxxxxxx>
To: "htcondor-users" <htcondor-users@xxxxxxxxxxx>
Cc: "sergey komissarov" <sergey.komissarov@xxxxxxxxxxxxxx>
Sent: Monday, November 22, 2021 6:58:40 PM
Subject: [HTCondor-users] Parallel schedd starts two jobs on the same slot.

Hello,

We have HTCondor v8.9.11 cluster which starts parallel tasks with dynamic slots via dedicated scheduler.

Sometimes schedd crashes when it trying to release claim for already deleted match record. I managed to trace this to DedicatedScheduler::createAllocations function and found that sometimes schedd uses match record from already running job as a slot for a new job. This happened because match record state is changed from M_ACTIVE to M_CLAIMED here: https://github.com/htcondor/htcondor/blob/master/src/condor_schedd.V6/schedd.cpp#L7795 . If I forbid change of M_ACTIVE state schedd does not crash. But it seems to me that I hide a real source of problem instead of fixing it. Can anyone advice where else I can look to trace this bug? 

----------
Sergey Komissarov
Senior Software Developer
DATADVANCE

This message may contain confidential information
constituting a trade secret of DATADVANCE. Any distribution,
use or copying of the information contained in this
message is ineligible except under the internal
regulations of DATADVANCE and may entail liability in
accordance with the current legislation of the Russian
Federation. If you have received this message by mistake
please immediately inform me of it. Thank you!
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/