[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Do Starters stop responding if they are queued for i/o?



On 4/10/2019 10:18 AM, Duncan Brown wrote:
> Hi Todd,
> 
> I have some evidence (it's not conclusive) that if a Starter ends up queued to transfer data back to a shadow for a long period of time (i.e. in q> state because of low limits and high load on incoming condor file i/o), then the Starter can stop responding to the "are you alive" queries from the Startd and gets hard killed. The job then gets rescheduled. Here's the relevant parts of the user log, StartLog and StarterLog. The user job exits with a checkpoint at 04/03/19 18:32:31, and is waiting to transfer it's checkpoint back to the busy schedd machine.
> 
> Any ideas? Is this possible?
> 
> Cheers,
> Duncan.

Hi Duncan,

Thanks for the carefully researched bug report, as usual!

Looking quick over the code, this looks possible to me.

The Starter fetches the input sandbox in the background, allowing it to 
continue to send keep-alive messages to the Startd.  All is good.  But 
for some odd reason, when the Starter sends back the output sandbox 
after the job completes, it does this in the foreground, meaning if it 
takes more than ~60 min (by default) the Startd will indeed kill it.

As an immediate workaround you could increase the "I think you are dead" 
timeout.

Meanwhile I will create a ticket and see if there is any purposeful 
reason about why the input sandbox is moved in the background but the 
output sandbox is moved in the foreground.

best
Todd