Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] About out of sync between schedd and collector

Date: Mon, 05 Dec 2016 22:16:26 +0000
From: Jaime Frey <jfrey@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] About out of sync between schedd and collector

On Dec 5, 2016, at 2:48 AM, jiangxw@xxxxxxxxxxxxxxx wrote:

In our cluster, occasionally, some jobs are not in schedd (these jobs can not be find with condor_q ), but

these jobs are occupying slots at the same time (these slots can be find with condor_status).

In schedd, the shadows of these jobs disappeared; In these startd machines which are occupied by jobs that can not be find in schedd, starters are running correctly.

When the job program is finished, the condor_starter can not be released. with condor_status, the slot is Busy.

So we have to find these machines, and restart these machines manually.

Is there some way recover shadow when shadow disappears but starter runs correctly.

Wish for replys.

A few questions to help determine whatâs going wrong in your pool:

Can these jobs be found using condor_history? If so, what status do they have?

If you search for these jobsâ ids in the condor_shadow daemon log, do you see error messages?

If you search the daemon longs for these stuck condor_starters, do you see messages like this:

Lost connection to shadow, waiting 2400 secs for reconnect

Are these condor_starters stuck for longer than 40 minutes (or the value of the JobLeaseDuration attribute in the job ad)?

Thanks and regards,

Jaime Frey

UW-Madison HTCondor Project

References:
- [HTCondor-users] About out of sync between schedd and collector
  - From: jiangxw@xxxxxxxxxxxxxxx

Prev by Date: [HTCondor-users] About out of sync between schedd and collector
Next by Date: [HTCondor-users] docker, CCB and private networks
Previous by thread: [HTCondor-users] About out of sync between schedd and collector
Next by thread: [HTCondor-users] docker, CCB and private networks
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [HTCondor-users] About out of sync between schedd and collector