[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] condor_transfer_data problem on major version switch



Hi all,

we have recently begun testing remote features in our glidein/condor pool to allow people from our institute to use condor from any authorised device (laptops, heterogenous work pools, etc.) without having to worry about any permanent condor infrastructure there. Basically we want to supply a drastically cut-down condor installation via a shared disk to supply only the commands necessary for interfacing with the remote daemons - as we are still in the testing phase, we are using a full condor suite (i.e. all bin, sbin, libraries, etc.) at the moment, though.

Now, while submitting (condor_submit -remote <remote schedd> <jdl>) and managing (condor_rm, condor_q, ...) works fine, we experience a strange bug with file transfer when our resources/glideins are running on 7.6.X (tested with 7.6.10 and 7.6.7) while the user condor package is 7.8.X. When trying to transfer the output back from our dedicated schedd, condor_transfer_data will request transfer of the "_condor_stderr" and "_condor_stdout" files which do not exist, causing the process to exit with an error [1]; this results in only the first job data being fetched (the process exits afterwards) and it will also leave the job alive in both the queue and spool, slowly polluting our schedd node with leftovers unless manually cleaned. As far as I understand, these files are stand-ins on the remote schedd/workers for the actual Out and Err files (i.e. "_condor_stderr" would get remapped to "path/to/$(Cluster).$(Process).err" after file transfer to the user), yet it appears that both the transfer worker->schedd AND schedd->user attempt to map them back (thus failing on the second iteration). On the schedd, the files are already stored as "/spool/<cluster.process folder>/$(Cluster).$(Process).err". Bottom line is, condor_transfer_data worked ONLY if both the user AND the glideins/workers were running on the same (major) version (tested with 7.6.10 and 7.8.4). Seeing how all other condor functions used worked flawlessly even across major versions, we are not certain if the version mismatch is the actual cause or if there is another reason for it; the condor changelog did not mention a change to the transfer_data process.

Our setup makes it very likely that we might have workers/resources running on different condor major versions, so knowing whether we also have to prepare remote submit packages matching all versions in use or have some leeway there, especially in light of a smooth workflow for users, would be very helpful.

Best regards,
Max

[1] $ condor_transfer_data -name <remote schedd> 391.0
DCSchedd::receiveJobSandbox:7003:File transfer failed for target job 391.0: SCHEDD at 129.13.133.37 failed to send file(s) to <129.13.133.12:60262>: error reading from /data/srv/condor/current/condor_local/spool/391/0/cluster391.proc0.subproc0/391.0.pin.py.stderr: (errno 2) No such file or directory; TOOL failed to receive file(s) from <129.13.133.37:9615>
AUTHENTICATE:1004:Failed to authenticate using FS
ERROR: Failed to spool job files.