[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Do Starters stop responding if they are queued for i/o?

Hi Todd,

Great, thanks. The problem was triggered by some underlying disk issue that was causing transfers to take a long time. Between fixing that and bumping up the keep alive, I think we're fine for now.


> On Apr 10, 2019, at 3:02 PM, Todd Tannenbaum <tannenba@xxxxxxxxxxx> wrote:
> On 4/10/2019 10:18 AM, Duncan Brown wrote:
>> Hi Todd,
>> I have some evidence (it's not conclusive) that if a Starter ends up queued to transfer data back to a shadow for a long period of time (i.e. in q> state because of low limits and high load on incoming condor file i/o), then the Starter can stop responding to the "are you alive" queries from the Startd and gets hard killed. The job then gets rescheduled. Here's the relevant parts of the user log, StartLog and StarterLog. The user job exits with a checkpoint at 04/03/19 18:32:31, and is waiting to transfer it's checkpoint back to the busy schedd machine.
>> Any ideas? Is this possible?
>> Cheers,
>> Duncan.
> Hi Duncan,
> Thanks for the carefully researched bug report, as usual!
> Looking quick over the code, this looks possible to me.
> The Starter fetches the input sandbox in the background, allowing it to 
> continue to send keep-alive messages to the Startd.  All is good.  But 
> for some odd reason, when the Starter sends back the output sandbox 
> after the job completes, it does this in the foreground, meaning if it 
> takes more than ~60 min (by default) the Startd will indeed kill it.
> As an immediate workaround you could increase the "I think you are dead" 
> timeout.
> Meanwhile I will create a ticket and see if there is any purposeful 
> reason about why the input sandbox is moved in the background but the 
> output sandbox is moved in the foreground.
> best
> Todd


Duncan Brown                              Room 263-1, Physics Department
Charles Brightman Professor of Physics     Syracuse University, NY 13244
http://dabrown.expressions.syr.edu                   Phone: 315 443 5993