[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Error from SCHEDD on Windows 10
- Date: Mon, 29 Jul 2019 14:38:00 -0700
- From: Andrew Cunningham <condor@xxxxxxxxxxxxxxxx>
- Subject: Re: [HTCondor-users] Error from SCHEDD on Windows 10
Thanks for the input. I have a fast disk and nothing much was
happening on the machine at the time.
Regarding the so called zombie processes. I managed to get to one of
the machines and the problem seems to be happening in the following
- The main executable of the job launches a sub-process to perform a
task ( vanilla universe)
- The job is suspended due to user activity on the machine. This would
require suspending the main process and the sub-process
- The job was continued ( according to the StarterLog)
- However it appears that the sub-process did not "continue". So at
this point Condor sees the job in the running state, but it will never
finish as the main process is just waiting on the sub-process to
Obviously condor has to keep track of sub-processes so it can "continue" them.
On Mon, Jul 29, 2019 at 2:01 PM John M Knoeller <johnkn@xxxxxxxxxxx> wrote:
> Given that you have at most 200 jobs running at a time, I would not expect that the lock used for the ShadowLog to result in long term starvation
> of a Shadow, but that seems to be what is happening.
> The drive your log files are being written to might be in the process failing, or perhaps you have some other process that is keeping the disk very busy?
> You could try moving your log directory to an SSD to speed things up.
> -----Original Message-----
> From: Andrew Cunningham <condor@xxxxxxxxxxxxxxxx>
> Sent: Monday, July 29, 2019 3:55 PM
> To: Condor-Users Mail List <condor-users@xxxxxxxxxxx>
> Cc: John M Knoeller <johnkn@xxxxxxxxxxx>
> Subject: Re: [HTCondor-users] Error from SCHEDD on Windows 10
> "The message about "waiting for a lock" is indeed a problem. What
> do you have SHADOW_DEBUG and/or ALL_DEBUG set to ?
> is 153.79 one of the zombie jobs? or is that a message about a job
> that is/was actually running?"
> All options for CONDOR are at the default. No special DEBUG options
> on. And yes, the zombie jobs included 153.79.