[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Schedd possibly spinning on a job



The schedd state is stored in $(SPOOL)/job_queue.log – shutting down the schedd and editing this file by hand to excise the problem job looks like it would be a bit tricky and error-prone, however.

 

Michael V. Pelletier
Information Technology
Digital Transformation & Innovation
Integrated Defense Systems
Raytheon Company

 

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Larne Pekowsky via HTCondor-users
Sent: Thursday, May 16, 2019 2:35 PM
To: 'John M Knoeller' <johnkn@xxxxxxxxxxx>; HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: Larne Pekowsky <lppekows@xxxxxxx>
Subject: [External] Re: [HTCondor-users] Schedd possibly spinning on a job

 

Hi tj,

 

I didn’t know about condor_sos, thanks!  Even with -timeoutmult 10 it didn’t work though.  Whatever the schedd is doing it isn’t listening to anyone.

 

Cheers,

 

                                                                                - Larne

 

 

From: John M Knoeller <johnkn@xxxxxxxxxxx>
Sent: Thursday, May 16, 2019 2:11 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: Larne Pekowsky <lppekows@xxxxxxx>
Subject: RE: Schedd possibly spinning on a job

 

Did you using condor_sos before the condor_rm command?

 

D_ALL will definitely make the problem worse by the way.  It’s insanely chatty.

 

-tj

 

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Larne Pekowsky via HTCondor-users
Sent: Thursday, May 16, 2019 12:36 PM
To: 'htcondor-users@xxxxxxxxxxx' <htcondor-users@xxxxxxxxxxx>
Cc: Larne Pekowsky <lppekows@xxxxxxx>
Subject: [HTCondor-users] Schedd possibly spinning on a job

 

Hi all,

 

Our schedd has been pegged at 100% cpu for several hours and immediately returns to that state on restart.  At D_FULLDEBUG the log floods with the message

 

   05/16/19 12:50:58 satisfyJobs: finding resources for 6092282.0

 

so it almost looks like the schedd is stuck in a loop on this job.  I’d like to remove it to see if that fixes the problem, but of course with the schedd running at 100% condor_rm can’t get through.  Any suggestions?  Also, is there any way to get more detailed information on what’s happening?  D_ALL didn’t seem to have anything useful.

 

Thanks,

 

                                                                                                - Larne