[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] About out of sync between schedd and collector



On Dec 5, 2016, at 2:48 AM, jiangxw@xxxxxxxxxxxxxxx wrote:

    In our cluster, occasionally, some jobs are not in schedd (these jobs can not be find with condor_q ), but
    these jobs are occupying slots at the same time (these slots can be find with condor_status).
    In schedd, the shadows of these jobs  disappeared; In these startd machines which are occupied by jobs that can not be find in schedd, starters are running correctly.
    When the job program is finished,  the condor_starter can not be released. with condor_status, the slot is Busy.
    So we have to find these machines, and restart these machines manually.
    Is there some way recover shadow when shadow disappears but starter runs correctly.
    Wish for replys.

A few questions to help determine whatâs going wrong in your pool:

Can these jobs be found using condor_history? If so, what status do they have?
If you search for these jobsâ ids in the condor_shadow daemon log, do you see error messages?
If you search the daemon longs for these stuck condor_starters, do you see messages like this:
  Lost connection to shadow, waiting 2400 secs for reconnect

Are these condor_starters stuck for longer than 40 minutes (or the value of the JobLeaseDuration attribute in the job ad)?

Thanks and regards,
Jaime Frey
UW-Madison HTCondor Project