[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Local Universe Jobs Cause the Schedd Daemon Process to Crash



Hi All,

 

We’ve been experiencing an issue with our Schedd daemon.  When submitting a local universe job the Schedd daemon process will crash and restart. When reviewing the logs, D_ALL, we see the following log entries:

 

 

SchedLog (Trimmed for readability):

06/12/20 09:03:36 (fd:20) (pid:29417) (cid:73921) (D_AUDIT) Submitting new job 95516.0

06/12/20 09:03:36 (fd:20) (pid:29417) (D_ALWAYS:2) New job: 95516.0

06/12/20 09:03:36 (fd:21) (pid:29417) (D_ALWAYS:2) New job: 95516.0, Duplicate Keys: 2, Total Keys: 3

06/12/20 09:03:36 (fd:18) (pid:29417) (D_ALWAYS:2) Found idle local universe job 95516.0

06/12/20 09:03:36 (fd:18) (pid:29417) (D_ALWAYS:2) Queueing job 95516.0 in runnable job queue

06/12/20 09:03:37 (fd:18) (pid:29417) (D_ALWAYS:2) Job prep for 95516.0 will not block, calling aboutToSpawnJobHandler() directly

06/12/20 09:03:37 (fd:18) (pid:29417) (D_ALWAYS:2) aboutToSpawnJobHandler() completed for job 95516.0, attempting to spawn job handler

06/12/20 09:03:37 (fd:18) (pid:29417) (D_ALWAYS:2) Starting local universe job 95516.0

06/12/20 09:03:37 (fd:18) (pid:29417) (D_ALWAYS:2) About to spawn /opt/condor/sbin/condor_starter condor_starter -f -job-cluster 95516 -job-proc 0 -header (95516.0)' ' -job-input-ad - -schedd-addr <10.110.130.149:32804>

06/12/20 09:03:37 (fd:18) (pid:29417) (D_ALWAYS:2) Cleared dirty attributes for job 95516.0

06/12/20 09:03:37 (fd:22) (pid:29417) (D_ALWAYS) Starting add_shadow_birthdate(95516.0)

06/12/20 09:03:37 (fd:24) (pid:13181) (D_DAEMONCORE) Create_Process: Arg: condor_starter -f -job-cluster 95516 -job-proc 0 -header (95516.0)' ' -job-input-ad - -schedd-addr <10.110.130.149:32804>

06/12/20 09:03:37 (fd:24) (pid:13181) (D_PROCFAMILY) About to register family for PID 13181 with the ProcD

06/12/20 09:03:37 (fd:24) (pid:13181) (D_ALWAYS) Result of "register_subfamily" operation from ProcD: ERROR: A family with the given root PID is already registered

06/12/20 09:03:37 (fd:24) (pid:13181) (D_ALWAYS) Create_Process: error registering family for pid 13181

06/12/20 09:03:37 (fd:25) (pid:29417) (D_ALWAYS) Create_Process(/opt/condor/sbin/condor_starter): child failed because it failed to register itself with the ProcD

06/12/20 09:03:37 (fd:22) (pid:29417) (D_ALWAYS|D_FAILURE) spawnJobHandlerRaw: CreateProcess(/opt/condor/sbin/condor_starter, condor_starter -f -job-cluster 95516 -job-proc 0 -header (95516.0)' ' -job-input-ad - -schedd-addr <10.110.130.149:32804>) failed

06/12/20 09:03:37 (fd:18) (pid:29417) (D_ALWAYS|D_FAILURE) Can't spawn local starter for job 95516.0

06/12/20 09:03:37 (fd:18) (pid:29417) (D_ALWAYS:2) Prioritized runnable job list will be rebuilt, because ClassAd attribute JobStatus=1 changed

06/12/20 09:03:37 (fd:18) (pid:29417) (D_ALWAYS:2) Marked job 95516.0 as IDLE

...

06/12/20 09:03:45 (fd:18) (pid:29417) (D_COMMAND) Calling Timer handler 39569 (start_job)

06/12/20 09:03:45 (fd:18) (pid:29417) (D_ALWAYS:2) Job prep for 95516.0 will not block, calling aboutToSpawnJobHandler() directly

06/12/20 09:03:45 (fd:18) (pid:29417) (D_ALWAYS:2) aboutToSpawnJobHandler() completed for job 95516.0, attempting to spawn job handler

06/12/20 09:03:45 (fd:18) (pid:29417) (D_ALWAYS:2) Starting local universe job 95516.0

06/12/20 09:03:45 (fd:18) (pid:29417) (D_ALWAYS:2) About to spawn /opt/condor/sbin/condor_starter condor_starter -f -job-cluster 95516 -job-proc 0 -header (95516.0)' ' -job-input-ad - -schedd-addr <10.110.130.149:32804>

06/12/20 09:03:45 (fd:18) (pid:29417) (D_ALWAYS:2) Cleared dirty attributes for job 95516.0

06/12/20 09:03:45 (fd:18) (pid:29417) (D_DAEMONCORE) Entering Create_Pipe()

06/12/20 09:03:45 (fd:18) (pid:29417) (D_DAEMONCORE) Entering Create_Named_Pipe()

06/12/20 09:03:45 (fd:24) (pid:29417) (D_DAEMONCORE) Create_Pipe() success read_handle=65536 write_handle=65537

06/12/20 09:03:45 (fd:24) (pid:29417) (D_ALWAYS|D_FAILURE) ERROR "Assertion ERROR on (shadowsByProcID->insert(new_rec->job_id, new_rec) == 0)" at line 10864 in file /project/condor-8.8.9/src/condor_schedd.V6/schedd.cpp

06/12/20 09:03:45 (fd:24) (pid:29417) (D_ALWAYS:2) ScheddCronJobMgr: Bye

06/12/20 09:03:45 (fd:24) (pid:29417) (D_ALWAYS) Cron: Killing all jobs

06/12/20 09:03:45 (fd:24) (pid:29417) (D_ALWAYS) CronJobList: Deleting all jobs

06/12/20 09:03:45 (fd:24) (pid:29417) (D_ALWAYS:2) CronJobMgr: bye

06/12/20 09:03:45 (fd:24) (pid:29417) (D_ALWAYS) Cron: Killing all jobs

06/12/20 09:03:45 (fd:24) (pid:29417) (D_ALWAYS) CronJobList: Deleting all jobs

06/12/20 09:03:45 (fd:24) (pid:29417) (D_DAEMONCORE) Cancel_Socket: cancelled socket 4 <<NULL>> 0x221c260

 

MasterLog:

06/12/20 09:03:45 (fd:12) (pid:4712) (D_COMMAND) DaemonCore: pid 29417 exited with status 1024, invoking reaper 1 <Daemons::DefaultReaper()>

06/12/20 09:03:45 (fd:12) (pid:4712) (D_ALWAYS) DefaultReaper unexpectedly called on pid 29417, status 1024.

06/12/20 09:03:45 (fd:12) (pid:4712) (D_ALWAYS|D_FAILURE) The SCHEDD (pid 29417) exited with status 4

06/12/20 09:03:45 (fd:12) (pid:4712) (D_PROCFAMILY) About to kill family with root process 29417 using the ProcD

 

 

Even a local universe job that simply prints Hello world will cause the Schedd daemon process to crash. Once the Schedd daemon process recovers the local universe job that was submitted will complete successfully. This issue only seems to be caused by local universe jobs, other universe jobs run without issue.

 

We first encounter this issue with Condor 8.8.5 and upgraded our cluster to 8.8.9 and the issues persisted.

 

We are hoping someone can shed some light on why we are seeing these crashes.

 

Thanks.

 

-Kevin Heinold




IMPORTANT: The information contained in this email and/or its attachments is confidential. If you are not the intended recipient, please notify the sender immediately by reply and immediately delete this message and all its attachments. Any review, use, reproduction, disclosure or dissemination of this message or any attachment by an unintended recipient is strictly prohibited. Neither this message nor any attachment is intended as or should be construed as an offer, solicitation or recommendation to buy or sell any security or other financial instrument. Neither the sender, his or her employer nor any of their respective affiliates makes any warranties as to the completeness or accuracy of any of the information contained herein or that this message or any of its attachments is free of viruses.