[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Remote Submit Fails To Spool Job Files



Hi All,

My current set up is a remote machine submitting jobs to a central manger where the jobs are sent to worker nodes. Recently the remote machines was upgraded from condor 8.0.6 to condor version 8.1.5. With version 8.1.5 the jobs submitted by the remote machine show up on the central manager as holding for a few seconds, for example:

113423.0 apf 5/23 20:42 Spooling input data files
113423.1 apf 5/23 20:42 Spooling input data files
113423.2 apf 5/23 20:42 Spooling input data files
113423.3 apf 5/23 20:42 Spooling input data files
113423.4 apf 5/23 20:42 Spooling input data files
113423.5 apf 5/23 20:42 Spooling input data files

After a few seconds the jobs are removed. I can see corresponding error messages on the remote submitter:

DCSchedd::spoolJobFiles:7002:File transfer failed for target job 113423.0: Failed to receive GoAhead message from <central manager's IP>.

The central manager is running condor version 8.0.3. Is there a configuration variable hidden somewhere that may be causing this issue? Is this something that an upgrade to a later stable condor version (on the side of the central manager) would likely solve?

Best Regards,
-Frank



--
----------
Frank Berghaus
University of Victoria
Research Associate
Physics & Astronomy
UVic Phone: +1 (250) 721-7741
UVic Office: Elliot 212