Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] condor job stays in RUN state after output transfer to remote storage

Date: Thu, 14 Mar 2024 09:28:22 -0500
From: Greg Thain <gthain@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] condor job stays in RUN state after output transfer to remote storage

On 3/10/24 11:38, Benoit Roland wrote:

Dear all,
I would like to understand why a particular user's job is staying inthe RUN state after the job's output has been transferred to theremote storage.



Hi Ben:

The short (but not useful) answer is that the job stays in the "R"unstate until the shadow exits, even when the job has called exit(2), andfile transfer has completed.Â Usually, the shadow exits pretty quicklyafter file transfer is done, but if something goes wrong, it might hangon longer.

I found the following error message appearing at a constant rate inthe ShadowLog of the job, well after the ouput has been retrieved onthe remote storage:
ERROR "Error from slot1_1@gridka-2ed723aef7@c01-011-108.gridka.de:Repeated attempts to transfer output failed for unknown reasons"at line 585 in file/tmp/__build/build-3k7WTP/BUILD/condor-23.5.0/src/condor_shadow.V6.1/pseudo_ops.cpp

Do you have access to the StarterLog.slotXXX on the EP?Â I think thatwill point to the problem.



-greg

References:
- [HTCondor-users] condor job stays in RUN state after output transfer to remote storage
  - From: Benoit Roland

Prev by Date: Re: [HTCondor-users] LoadAvg values in PartitionableSlots expected?
Next by Date: [HTCondor-users] Schedule user jobs to a particular Execute node
Previous by thread: [HTCondor-users] condor job stays in RUN state after output transfer to remote storage
Next by thread: [HTCondor-users] empty output directory locally retrieved while performing transfer to remote storage
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [HTCondor-users] condor job stays in RUN state after output transfer to remote storage