[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Error from SCHEDD on Windows 10
- Date: Mon, 29 Jul 2019 16:16:00 -0700
- From: Andrew Cunningham <condor@xxxxxxxxxxxxxxxx>
- Subject: Re: [HTCondor-users] Error from SCHEDD on Windows 10
Sadly, I restarted condor on the machine in question so I can't go
back and check the state of the process. But it was 'running' and
using 0% CPU, so for appeared to be in the suspended state.
It's not a problem for me if the jobs just gets killed/evicted and
rescheduled. I have been using the
USE POLICY : DESKTOP
which obviously is not implementing that "POLICY" you are suggesting.
On Mon, Jul 29, 2019 at 2:49 PM John M Knoeller <johnkn@xxxxxxxxxxx> wrote:
> Is the child process actually in stuck in suspended state? The sysinternals process explorer can show you if it is
> We mostly discourage the use of SUSPEND in HTCondor these days because if the process does get around
> to vacating, that generates a bunch of new activity and that tends to annoy users. It's better to just kill the
> job right away when we detect user activity. Would that work for you?
> -----Original Message-----
> From: Andrew Cunningham <condor@xxxxxxxxxxxxxxxx>
> Sent: Monday, July 29, 2019 4:38 PM
> To: John M Knoeller <johnkn@xxxxxxxxxxx>
> Cc: Condor-Users Mail List <condor-users@xxxxxxxxxxx>
> Subject: Re: [HTCondor-users] Error from SCHEDD on Windows 10
> Hi John,
> Thanks for the input. I have a fast disk and nothing much was
> happening on the machine at the time.
> Regarding the so called zombie processes. I managed to get to one of
> the machines and the problem seems to be happening in the following
> - The main executable of the job launches a sub-process to perform a
> task ( vanilla universe)
> - The job is suspended due to user activity on the machine. This would
> require suspending the main process and the sub-process
> - The job was continued ( according to the StarterLog)
> - However it appears that the sub-process did not "continue". So at
> this point Condor sees the job in the running state, but it will never
> finish as the main process is just waiting on the sub-process to
> Obviously condor has to keep track of sub-processes so it can "continue" them.
> On Mon, Jul 29, 2019 at 2:01 PM John M Knoeller <johnkn@xxxxxxxxxxx> wrote:
> > Given that you have at most 200 jobs running at a time, I would not expect that the lock used for the ShadowLog to result in long term starvation
> > of a Shadow, but that seems to be what is happening.
> > The drive your log files are being written to might be in the process failing, or perhaps you have some other process that is keeping the disk very busy?
> > You could try moving your log directory to an SSD to speed things up.
> > -tj
> > -----Original Message-----
> > From: Andrew Cunningham <condor@xxxxxxxxxxxxxxxx>
> > Sent: Monday, July 29, 2019 3:55 PM
> > To: Condor-Users Mail List <condor-users@xxxxxxxxxxx>
> > Cc: John M Knoeller <johnkn@xxxxxxxxxxx>
> > Subject: Re: [HTCondor-users] Error from SCHEDD on Windows 10
> > "The message about "waiting for a lock" is indeed a problem. What
> > do you have SHADOW_DEBUG and/or ALL_DEBUG set to ?
> > is 153.79 one of the zombie jobs? or is that a message about a job
> > that is/was actually running?"
> > All options for CONDOR are at the default. No special DEBUG options
> > on. And yes, the zombie jobs included 153.79.