[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Error from SCHEDD on Windows 10

Hi John,
Sadly, I restarted condor on the machine in question so I can't go
back and check the state of the process. But it was 'running' and
using 0% CPU, so for appeared to be in the suspended state.
It's not a problem for me if the jobs just gets killed/evicted and
rescheduled. I have been using the
which obviously is not implementing that "POLICY" you are suggesting.


On Mon, Jul 29, 2019 at 2:49 PM John M Knoeller <johnkn@xxxxxxxxxxx> wrote:
> Is the child process actually in stuck in suspended state?  The sysinternals process explorer can show you if it is
> We mostly discourage the use of SUSPEND in HTCondor these days because if the process does get around
> to vacating, that generates a bunch of new activity and that tends to annoy users.   It's better to just kill the
> job right away when we detect user activity.   Would that work for you?
> -tj
> -----Original Message-----
> From: Andrew Cunningham <condor@xxxxxxxxxxxxxxxx>
> Sent: Monday, July 29, 2019 4:38 PM
> To: John M Knoeller <johnkn@xxxxxxxxxxx>
> Cc: Condor-Users Mail List <condor-users@xxxxxxxxxxx>
> Subject: Re: [HTCondor-users] Error from SCHEDD on Windows 10
> Hi John,
> Thanks for the input. I have a fast disk and nothing much was
> happening on the machine at the time.
> Regarding the so called zombie processes. I managed to get to one of
> the machines and the problem seems to be happening in the following
> situation
> - The main executable of the job launches a sub-process to perform a
> task ( vanilla universe)
> - The job is suspended due to user activity on the machine. This would
> require suspending the main process and the sub-process
> - The job was continued ( according to the StarterLog)
> - However it appears that the sub-process did not "continue". So at
> this point Condor sees the job in the running state, but it will never
> finish as the main process is just waiting on the sub-process to
> complete.
> Obviously condor has to keep track of sub-processes so it can "continue" them.
> On Mon, Jul 29, 2019 at 2:01 PM John M Knoeller <johnkn@xxxxxxxxxxx> wrote:
> >
> > Given that you have at most 200 jobs running at a time,  I would not expect that the lock used for the ShadowLog to result in long term starvation
> > of a Shadow, but that seems to be what is happening.
> > The drive your log files are being written to  might be in the process failing, or perhaps you have some other process that is keeping the disk very busy?
> >
> > You could try moving your log directory to an SSD to speed things up.
> >
> > -tj
> >
> > -----Original Message-----
> > From: Andrew Cunningham <condor@xxxxxxxxxxxxxxxx>
> > Sent: Monday, July 29, 2019 3:55 PM
> > To: Condor-Users Mail List <condor-users@xxxxxxxxxxx>
> > Cc: John M Knoeller <johnkn@xxxxxxxxxxx>
> > Subject: Re: [HTCondor-users] Error from SCHEDD on Windows 10
> >
> > "The message about "waiting for a lock"  is indeed a problem.   What
> > do you have SHADOW_DEBUG and/or ALL_DEBUG set to ?
> >
> > is 153.79 one of the zombie jobs?  or is that a message about a job
> > that is/was actually running?"
> >
> > All options for CONDOR are at the default. No special DEBUG options
> > on. And yes, the zombie jobs included 153.79.