Subject: Re: [HTCondor-users] Strange Condor Behavior - Possible Bug
"HTCondor-users" <htcondor-users-bounces@xxxxxxxxxxx>
wrote on 09/30/2015 05:03:48 PM:
> > It appears to me that the job is hung on transferring
output for an hour
> after running the job to completion. Then after an hour the
condor daemon
> copying the data is determined to be hung and is killed. However
we see
> the output file transferred to the schedd. Similar behavior
is observed
> on all the jobs that donât âfinishâ. The behavior only seems
to appear
> in longer running jobs as all of the jobs are setup in the same way.
> > Thanks. > > -- > Will Deck
As you're describing this, it sounds familiar - we've
had situations where there's a very deep output transfer queue from a large
number of multi-gigabyte runs, and so we can wind up with hundreds of slots
waiting to transfer, and we also wind up in a situation where some jobs
croak and wind up with multiple starts. Eventually everything grinds its
way through to completion, but it takes much longer than it really should
given what we've spent on network interfaces. I haven't had a chance to
dig into it very deeply, since the ebb and flow of the work means that
that type of job is on the back burner for now. I'll peruse your logs in
more detail and plan to take a closer look at it with one of the job submitters
and see what we can come up with.