[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] condor_schedd fails after some time



This appears to be a logic error in the condor_schedd. Itâs attempting to create two data structures for a single parallel job in a table that should only have one entry per job. To complicate matters, I see thereâs a bug in one of the log messages that we could use to figure out whatâs going wrong.

My quick inspection of the code didnât turn up any obvious ways to trigger the double-entry problem.

This is happening while the condor_schedd is attempting to reconnect to running parallel jobs after a restart. Are you seeing this happen more than once?

 - Jaime

On Oct 17, 2021, at 2:18 PM, Dmitry A. Golubkov via HTCondor-users <htcondor-users@xxxxxxxxxxx> wrote:

Dear all,

I have the problem with my cluster, condor_schedd fails after some time with the error in the log:

2021-10-17T13:52:30.814107888Z condor_schedd[12521]: DedicatedScheduler creating Allocations for reconnected job (6.0)
2021-10-17T13:52:30.896151617Z condor_schedd[12521]: DedicatedScheduler creating Allocations for reconnected job (6.53)
2021-10-17T13:52:30.896566762Z condor_schedd[12521]: ERROR "Assertion ERROR on (allocations->insert( cluster, alloc ) == 0)" at line 2929 in file /var/lib/condor/execute/slot1/dir_26614/userdir/.tmpdakAr8/condor-8.9.11/src/condor_schedd.V6/dedicated_scheduler.cpp
2021-10-17T13:52:30.898919572Z condor_schedd[12521]: Cron: Killing all jobs
2021-10-17T13:52:30.898943994Z condor_schedd[12521]: CronJobList: Deleting all jobs
2021-10-17T13:52:30.975443327Z condor_schedd[12521]: Cron: Killing all jobs
2021-10-17T13:52:30.975483659Z condor_schedd[12521]: CronJobList: Deleting all jobs
2021-10-17T13:52:30.975494422Z condor_master[1048]: DefaultReaper unexpectedly called on pid 12521, status 1024.
2021-10-17T13:52:30.975498252Z condor_master[1048]: The SCHEDD (pid 12521) exited with status 4


Any ideas of the problem's reason?


Dmitry A. Golubkov
DATADVANCE
Mob. +7 910 4400124
dmitry.golubkov@xxxxxxxxxxxxxx
This message may contain confidential information
constituting a trade secret of DATADVANCE. Any distribution,
use or copying of the information contained in this
message is ineligible except under the internal
regulations of DATADVANCE and may entail liability in
accordance with the current legislation of the Russian
Federation. If you have received this message by mistake
please immediately inform me of it. Thank you!
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/