[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Error from SCHEDD on Windows 10
- Date: Mon, 29 Jul 2019 16:33:56 -0700
- From: Andrew Cunningham <condor@xxxxxxxxxxxxxxxx>
- Subject: Re: [HTCondor-users] Error from SCHEDD on Windows 10
Just a to follow-up on that, if I have USE POLICY : DESKTOP in the
condor_config, is the simplest way to cause all jobs that would have
been suspended to go straight to vacate/kill would be to put
WANT_SUSPEND=FALSE in the condor_config.local file?
On Mon, Jul 29, 2019 at 4:16 PM Andrew Cunningham
> Hi John,
> Sadly, I restarted condor on the machine in question so I can't go
> back and check the state of the process. But it was 'running' and
> using 0% CPU, so for appeared to be in the suspended state.
> It's not a problem for me if the jobs just gets killed/evicted and
> rescheduled. I have been using the
> USE POLICY : DESKTOP
> which obviously is not implementing that "POLICY" you are suggesting.
> On Mon, Jul 29, 2019 at 2:49 PM John M Knoeller <johnkn@xxxxxxxxxxx> wrote:
> > Is the child process actually in stuck in suspended state? The sysinternals process explorer can show you if it is
> > We mostly discourage the use of SUSPEND in HTCondor these days because if the process does get around
> > to vacating, that generates a bunch of new activity and that tends to annoy users. It's better to just kill the
> > job right away when we detect user activity. Would that work for you?
> > -tj
> > -----Original Message-----
> > From: Andrew Cunningham <condor@xxxxxxxxxxxxxxxx>
> > Sent: Monday, July 29, 2019 4:38 PM
> > To: John M Knoeller <johnkn@xxxxxxxxxxx>
> > Cc: Condor-Users Mail List <condor-users@xxxxxxxxxxx>
> > Subject: Re: [HTCondor-users] Error from SCHEDD on Windows 10
> > Hi John,
> > Thanks for the input. I have a fast disk and nothing much was
> > happening on the machine at the time.
> > Regarding the so called zombie processes. I managed to get to one of
> > the machines and the problem seems to be happening in the following
> > situation
> > - The main executable of the job launches a sub-process to perform a
> > task ( vanilla universe)
> > - The job is suspended due to user activity on the machine. This would
> > require suspending the main process and the sub-process
> > - The job was continued ( according to the StarterLog)
> > - However it appears that the sub-process did not "continue". So at
> > this point Condor sees the job in the running state, but it will never
> > finish as the main process is just waiting on the sub-process to
> > complete.
> > Obviously condor has to keep track of sub-processes so it can "continue" them.
> > On Mon, Jul 29, 2019 at 2:01 PM John M Knoeller <johnkn@xxxxxxxxxxx> wrote:
> > >
> > > Given that you have at most 200 jobs running at a time, I would not expect that the lock used for the ShadowLog to result in long term starvation
> > > of a Shadow, but that seems to be what is happening.
> > > The drive your log files are being written to might be in the process failing, or perhaps you have some other process that is keeping the disk very busy?
> > >
> > > You could try moving your log directory to an SSD to speed things up.
> > >
> > > -tj
> > >
> > > -----Original Message-----
> > > From: Andrew Cunningham <condor@xxxxxxxxxxxxxxxx>
> > > Sent: Monday, July 29, 2019 3:55 PM
> > > To: Condor-Users Mail List <condor-users@xxxxxxxxxxx>
> > > Cc: John M Knoeller <johnkn@xxxxxxxxxxx>
> > > Subject: Re: [HTCondor-users] Error from SCHEDD on Windows 10
> > >
> > > "The message about "waiting for a lock" is indeed a problem. What
> > > do you have SHADOW_DEBUG and/or ALL_DEBUG set to ?
> > >
> > > is 153.79 one of the zombie jobs? or is that a message about a job
> > > that is/was actually running?"
> > >
> > > All options for CONDOR are at the default. No special DEBUG options
> > > on. And yes, the zombie jobs included 153.79.