[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] condor job stays in RUN state after output transfer to remote storage



On 3/10/24 11:38, Benoit Roland wrote:
Dear all,

I would like to understand why a particular user's job is staying in the RUN state after the job's output has been transferred to the remote storage.


Hi Ben:

The short (but not useful) answer is that the job stays in the "R"un state until the shadow exits, even when the job has called exit(2), and file transfer has completed. Usually, the shadow exits pretty quickly after file transfer is done, but if something goes wrong, it might hang on longer.



I found the following error message appearing at a constant rate in the ShadowLog of the job, well after the ouput has been retrieved on the remote storage:

ERROR "Error from slot1_1@gridka-2ed723aef7@c01-011-108.gridka.de: Repeated attempts to transfer output failed for unknown reasons" at line 585 in file /tmp/__build/build-3k7WTP/BUILD/condor-23.5.0/src/condor_shadow.V6.1/pseudo_ops.cpp


Do you have access to the StarterLog.slotXXX on the EP? I think that will point to the problem.


-greg