[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] condor_schedd fails after some time



Dear Jaime,

And after restarts by condor_master, condor_schedd fails immediately on start:

condor_schedd[3098]: DedicatedScheduler creating Allocations for reconnected job (19.0)
condor_schedd[3098]: Dedicated Scheduler:: couldn't find machine slot1_1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx to reconnect to
condor_schedd[3098]: Dedicated Scheduler:: couldn't find machine slot1_3@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx to reconnect to
condor_schedd[3098]: DedicatedScheduler creating Allocations for reconnected job (19.2)
condor_schedd[3098]: Allocation for job 19.0, nprocs: 3
condor_schedd[3098]: Allocation for job 19.0, nprocs: 3
condor_schedd[3098]: DBG: DedicatedScheduler::spawnJobs call.
condor_schedd[3098]: ERROR "spawnJobs(): allocation node has no matches!" at line 2111 in file /htcondor/src/condor_schedd.V6/dedicated_scheduler.cpp
condor_schedd[3098]: Cron: Killing all jobs
condor_schedd[3098]: CronJobList: Deleting all jobs
condor_schedd[3098]: Cron: Killing all jobs
condor_schedd[3098]: CronJobList: Deleting all jobs
condor_master[823]: DefaultReaper unexpectedly called on pid 3098, status 1024.

Best regards,
Dmitry.


From: "HTCondor-Users Mail List" <htcondor-users@xxxxxxxxxxx>
To: "Jaime Frey" <jfrey@xxxxxxxxxxx>
Cc: "Dmitry Golubkov" <dmitry.golubkov@xxxxxxxxxxxxxx>, "HTCondor-Users Mail List" <htcondor-users@xxxxxxxxxxx>
Sent: Wednesday, October 27, 2021 12:28:01 AM
Subject: Re: [HTCondor-users] condor_schedd fails after some time

Dear Jaime,
Today I got one more ASSERT:
condor_schedd[1406]: Dedicated Scheduler:: couldn't find machine slot1_3@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx to reconnect to
condor_schedd[1406]: SECMAN: removing lingering non-negotiated security session <10.244.3.115:45651?addrs=10.244.3.115-45651&alias=pseven-htcondorexecute-deploy-78459764d7-2jtth.pseven-htcondor&noUDP&sock=startd_822_b758>#1635282438#102 because it conflicts with new request
condor_schedd[1406]: ERROR "Assertion ERROR on (all_matches->insert(host, mrec) == 0)" at line 4396 in file /htcondor/src/condor_schedd.V6/dedicated_scheduler.cpp
condor_schedd[1406]: Cron: Killing all jobs
condor_schedd[1406]: CronJobList: Deleting all jobs
condor_schedd[1406]: Cron: Killing all jobs
condor_schedd[1406]: CronJobList: Deleting all jobs
condor_master[823]: DefaultReaper unexpectedly called on pid 1406, status 1024.
condor_master[823]: The SCHEDD (pid 1406) exited with status 4

;( Can I remove this ASSERT to make it works? Or it is a bad idea?

Dmitry.



From: "HTCondor-Users Mail List" <htcondor-users@xxxxxxxxxxx>
To: "Jaime Frey" <jfrey@xxxxxxxxxxx>
Cc: "Dmitry Golubkov" <dmitry.golubkov@xxxxxxxxxxxxxx>, "HTCondor-Users Mail List" <htcondor-users@xxxxxxxxxxx>
Sent: Saturday, October 23, 2021 12:16:08 PM
Subject: Re: [HTCondor-users] condor_schedd fails after some time

Dear Jaime,
In the log I see the following:
condor_schedd[129]: WARNING: claim id not found for new dynamic slot slot1_4@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx -- ignoring this resource
condor_schedd[129]: WARNING: claim id not found for new dynamic slot slot1_3@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx -- ignoring this resource

Dmitry.


From: "HTCondor-Users Mail List" <htcondor-users@xxxxxxxxxxx>
To: "Jaime Frey" <jfrey@xxxxxxxxxxx>
Cc: "Dmitry Golubkov" <dmitry.golubkov@xxxxxxxxxxxxxx>, "HTCondor-Users Mail List" <htcondor-users@xxxxxxxxxxx>
Sent: Saturday, October 23, 2021 12:02:10 PM
Subject: Re: [HTCondor-users] condor_schedd fails after some time

Dear Jaime,
Also interesting observation, I am testing the job which contains three sub-jobs. In my test one job was scheduled to first executer and the other two jobs to the second executor. So, I killed the first executer, and it disappeared from the list of condor_status after about half an hour (why? how to configure this timeout?), but! two slots claimed on the second executor were kept claimed, also there are still claimed. I have waited about 8 hours.

P.S.: I have access to the htcondor source and can apply any patches to make tests, JFYI.

Best regards,
Dmitry.


From: "HTCondor-Users Mail List" <htcondor-users@xxxxxxxxxxx>
To: "Jaime Frey" <jfrey@xxxxxxxxxxx>
Cc: "Dmitry Golubkov" <dmitry.golubkov@xxxxxxxxxxxxxx>, "HTCondor-Users Mail List" <htcondor-users@xxxxxxxxxxx>
Sent: Saturday, October 23, 2021 4:41:49 AM
Subject: Re: [HTCondor-users] condor_schedd fails after some time

Dear Jaime,
I have removed ASSERT from the code and set STARTD_CONTACT_TIMEOUT = 10, but the killed executor keeps in the list (condor_status) for a long time (more than ten) and all jobs got stuck. Jobs have been moving from RUN to IDLE and from IDLE to RUN state with no success. Inexistent executer has claimed and has slot in Busy state. New jobs also got stuck after the executor killing. Any ideas, how to configure/fix htcondor to detect inexistent executors somehow? And yes, after removing ASSERTION (I know it is not a solution) the schedd keeps running after executor killing.
Best regards,
Dmitry.


From: "Jaime Frey" <jfrey@xxxxxxxxxxx>
To: "Dmitry Golubkov" <dmitry.golubkov@xxxxxxxxxxxxxx>
Cc: "HTCondor-Users Mail List" <htcondor-users@xxxxxxxxxxx>
Sent: Saturday, October 23, 2021 12:03:23 AM
Subject: Re: [HTCondor-users] condor_schedd fails after some time

The full log was very helpful for me to determine whatâs going wrong. The Assertion ERROR is a logic bug in the schedd, and I havenât found the source yet. But itâs not the initial failure. The failure was triggered by the death of one of your execute machines, as you describe.

When the execute machine was killed, the condor_shadow for each job running there exited, indicating it canât talk with the condor_starter on the execute machine. The schedd then tried to gracefully deactivate the claim of each node of the job. For job nodes that were running on the killed execute machine, the scheddâs network connection blocked until it timed out after 45 seconds. For a parallel job with a lot of nodes, this can take a while. This graceful shutdown of claims is done serially, during which time the schedd doesnât do any of its other tasks, including sending messages to the condor_master that itâs alive and functional. It appears you have set NOT_RESPONDING_TIMEOUT=1200 in your configuration, which means that the condor_master will kill the condor_schedd if it doesnât send any alive messages for 20 minutes. This happened three times in the logs that you provided.

When the condor_schedd restarts, it attempts to reconnect to the condor_starters of any running jobs. After the third time the condor_master killed the condor_schedd, this reconnection process stumbled over the Assertion ERROR, and it did so repeatedly.

To avoid the initial killing of the condor_schedd, I suggest the following changes to your configuration:
* Remove the setting of NOT_RESPONDING_TIMEOUT. This will restore the default timeout of one hour, which will give the schedd more time to do its graceful shutdown of claims after an execute machine goes away.
* Set STARTD_CONTACT_TIMEOUT=10. This will shorten the timeout on network connections when an execute machine has been killed, so the scheddâs graceful shutdown of claims can proceed more quickly.

If network connections to a killed execute machine failed immediately, instead of hanging indefinitely, this problem would go away without any HTCondor configuration changes.

I will spend a little more time looking at the Assertion ERROR, since that is a bug in the HTCondor bug that we should fix.

 - Jaime

On Oct 21, 2021, at 12:39 PM, Dmitry A. Golubkov <dmitry.golubkov@xxxxxxxxxxxxxx> wrote:

Dear Jaime,

From my point of view (I am not an expert) the Assertion ERROR is the crashing reason.

Dmitry.


From: "Jaime Frey" <jfrey@xxxxxxxxxxx>
To: "Dmitry Golubkov" <dmitry.golubkov@xxxxxxxxxxxxxx>
Cc: "HTCondor-Users Mail List" <htcondor-users@xxxxxxxxxxx>
Sent: Thursday, October 21, 2021 7:13:12 PM
Subject: Re: [HTCondor-users] condor_schedd fails after some time

For this error to happen, the condor_schedd must have previously restarted while some parallel jobs were running. Killing one of the execute instances wouldnât be enough by itself to trigger this failure. Can you search your SchedLog for any signs of the schedd exiting or crashing without this error ('ERROR "Assertion ERROR on (allocations->insert( cluster, alloc ) == 0)â'?

 - Jaime

On Oct 18, 2021, at 2:35 PM, Dmitry A. Golubkov <dmitry.golubkov@xxxxxxxxxxxxxx> wrote:

Dear Jaime,

> Are you seeing this happen more than once?

It happens each time after htcondor_execute fails. In my test configuration, I am using one htcondor_submit instance and several htcondor_execute instances,  I run some jobs and wait until the jobs become running, then I  (or my orchestrator kill) the one of htcondor execute, after this action, after some time (1-5 min) condor_schedd fails with the error below.
JFYI: Condor version is 8.9.11-1.2

Best regards,
Dmitry.


From: "Jaime Frey" <jfrey@xxxxxxxxxxx>
To: "HTCondor-Users Mail List" <htcondor-users@xxxxxxxxxxx>
Cc: "Dmitry Golubkov" <dmitry.golubkov@xxxxxxxxxxxxxx>
Sent: Monday, October 18, 2021 9:28:26 PM
Subject: Re: [HTCondor-users] condor_schedd fails after some time

This appears to be a logic error in the condor_schedd. Itâs attempting to create two data structures for a single parallel job in a table that should only have one entry per job. To complicate matters, I see thereâs a bug in one of the log messages that we could use to figure out whatâs going wrong.

My quick inspection of the code didnât turn up any obvious ways to trigger the double-entry problem.

This is happening while the condor_schedd is attempting to reconnect to running parallel jobs after a restart. Are you seeing this happen more than once?

 - Jaime

On Oct 17, 2021, at 2:18 PM, Dmitry A. Golubkov via HTCondor-users <htcondor-users@xxxxxxxxxxx> wrote:

Dear all,

I have the problem with my cluster, condor_schedd fails after some time with the error in the log:

2021-10-17T13:52:30.814107888Z condor_schedd[12521]: DedicatedScheduler creating Allocations for reconnected job (6.0)
2021-10-17T13:52:30.896151617Z condor_schedd[12521]: DedicatedScheduler creating Allocations for reconnected job (6.53)
2021-10-17T13:52:30.896566762Z condor_schedd[12521]: ERROR "Assertion ERROR on (allocations->insert( cluster, alloc ) == 0)" at line 2929 in file /var/lib/condor/execute/slot1/dir_26614/userdir/.tmpdakAr8/condor-8.9.11/src/condor_schedd.V6/dedicated_scheduler.cpp
2021-10-17T13:52:30.898919572Z condor_schedd[12521]: Cron: Killing all jobs
2021-10-17T13:52:30.898943994Z condor_schedd[12521]: CronJobList: Deleting all jobs
2021-10-17T13:52:30.975443327Z condor_schedd[12521]: Cron: Killing all jobs
2021-10-17T13:52:30.975483659Z condor_schedd[12521]: CronJobList: Deleting all jobs
2021-10-17T13:52:30.975494422Z condor_master[1048]: DefaultReaper unexpectedly called on pid 12521, status 1024.
2021-10-17T13:52:30.975498252Z condor_master[1048]: The SCHEDD (pid 12521) exited with status 4


Any ideas of the problem's reason?


Dmitry A. Golubkov
DATADVANCE
Mob. +7 910 4400124
dmitry.golubkov@xxxxxxxxxxxxxx
This message may contain confidential information
constituting a trade secret of DATADVANCE. Any distribution,
use or copying of the information contained in this
message is ineligible except under the internal
regulations of DATADVANCE and may entail liability in
accordance with the current legislation of the Russian
Federation. If you have received this message by mistake
please immediately inform me of it. Thank you!
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/







_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/