[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Schedd possibly spinning on a job



Thanks for the followup!

How large is large in your case?  This is likely still something we'll want to fix since nobody wants their schedd taken down.


Cheers,
-zach


ïOn 5/16/19, 2:44 PM, "HTCondor-users on behalf of Larne Pekowsky via HTCondor-users" <htcondor-users-bounces@xxxxxxxxxxx on behalf of htcondor-users@xxxxxxxxxxx> wrote:

    Hi all,
     
    Just to close this out in case anyone is curious, the problem originated because this is a parallel universe job and the user inadvertently set the machine count to a very large number.
     
    Cheers,
     
                                                                                    - Larne
     
     
    From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx>
    On Behalf Of Larne Pekowsky via HTCondor-users
    Sent: Thursday, May 16, 2019 3:02 PM
    To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
    Cc: Larne Pekowsky <lppekows@xxxxxxx>
    Subject: Re: [HTCondor-users] Schedd possibly spinning on a job
    
    
     
    Hi Michael,
     
    Thanks!  With nothing else to lose I backed up the file and did
     
    grep -v 6092282 ~/job_queue.log > job_queue.log
     
    then restarted and that fixed it.  Now we just need to figure out what it is about this job that caused thisâ
     
    Cheers,
     
                                                                                                    - Larne
     
    From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx>
    On Behalf Of Michael Pelletier
    Sent: Thursday, May 16, 2019 2:43 PM
    To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
    Subject: Re: [HTCondor-users] Schedd possibly spinning on a job
    
    
     
    The schedd state is stored in $(SPOOL)/job_queue.log â shutting down the schedd and editing this file by hand to excise the problem job looks like it would be a bit tricky and error-prone, however.
     
    Michael V. Pelletier
    Information Technology
    Digital Transformation & Innovation
    Integrated Defense Systems
    Raytheon Company
    
     
    From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx>
    On Behalf Of Larne Pekowsky via HTCondor-users
    Sent: Thursday, May 16, 2019 2:35 PM
    To: 'John M Knoeller' <johnkn@xxxxxxxxxxx>; HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
    Cc: Larne Pekowsky <lppekows@xxxxxxx>
    Subject: [External] Re: [HTCondor-users] Schedd possibly spinning on a job
    
    
     
    Hi tj,
     
    I didnât know about condor_sos, thanks!  Even with -timeoutmult 10 it didnât work though.  Whatever the schedd is doing it isnât listening to anyone.
     
    Cheers,
     
                                                                                    - Larne
     
     
    From: John M Knoeller <johnkn@xxxxxxxxxxx>
    
    Sent: Thursday, May 16, 2019 2:11 PM
    To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
    Cc: Larne Pekowsky <lppekows@xxxxxxx>
    Subject: RE: Schedd possibly spinning on a job
    
    
     
    Did you using condor_sos before the condor_rm command?
     
    D_ALL will definitely make the problem worse by the way.  Itâs insanely chatty.
     
    -tj
     
    From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx>
    On Behalf Of Larne Pekowsky via HTCondor-users
    Sent: Thursday, May 16, 2019 12:36 PM
    To: 'htcondor-users@xxxxxxxxxxx' <htcondor-users@xxxxxxxxxxx>
    Cc: Larne Pekowsky <lppekows@xxxxxxx>
    Subject: [HTCondor-users] Schedd possibly spinning on a job
    
    
     
    Hi all,
     
    Our schedd has been pegged at 100% cpu for several hours and immediately returns to that state on restart.  At D_FULLDEBUG the log floods with the message
     
       05/16/19 12:50:58 satisfyJobs: finding resources for 6092282.0
     
    so it almost looks like the schedd is stuck in a loop on this job.  Iâd like to remove it to see if that fixes the problem, but of course with the schedd running at 100% condor_rm canât get through.  Any suggestions?  Also, is there any
     way to get more detailed information on whatâs happening?  D_ALL didnât seem to have anything useful.
     
    Thanks,
     
                                                                                                    - Larne