[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Strange Condor Behavior - Possible Bug



"HTCondor-users" <htcondor-users-bounces@xxxxxxxxxxx> wrote on 09/30/2015 05:03:48 PM:

>  

> It appears to me that the job is hung on transferring output for an hour
> after running the job to completion.  Then after an hour the condor daemon
> copying the data is determined to be hung and is killed.  However we see
> the output file transferred to the schedd.  Similar behavior is observed
> on all the jobs that donât âfinishâ.   The behavior only seems to appear
> in longer running jobs as all of the jobs are setup in the same way.

>  
> Thanks.
>  
> --
> Will Deck

As you're describing this, it sounds familiar - we've had situations where there's a very deep output transfer queue from a large number of multi-gigabyte runs, and so we can wind up with hundreds of slots waiting to transfer, and we also wind up in a situation where some jobs croak and wind up with multiple starts. Eventually everything grinds its way through to completion, but it takes much longer than it really should given what we've spent on network interfaces. I haven't had a chance to dig into it very deeply, since the ebb and flow of the work means that that type of job is on the back burner for now. I'll peruse your logs in more detail and plan to take a closer look at it with one of the job submitters and see what we can come up with.

        -Michael Pelletier.