Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] condor_transfer_data problem on major version switch

Date: Fri, 26 Oct 2012 10:37:06 +0200
From: Max Fischer <mfischer@xxxxxxxxxxxxxxxxxxxx>
Subject: [Condor-users] condor_transfer_data problem on major version switch

Hi all,

we have recently begun testing remote features in our glidein/condorpool to allow people from our institute to use condor from anyauthorised device (laptops, heterogenous work pools, etc.) withouthaving to worry about any permanent condor infrastructure there.Basically we want to supply a drastically cut-down condor installationvia a shared disk to supply only the commands necessary for interfacingwith the remote daemons - as we are still in the testing phase, we areusing a full condor suite (i.e. all bin, sbin, libraries, etc.) at themoment, though.

Now, while submitting (condor_submit -remote <remote schedd> <jdl>) andmanaging (condor_rm, condor_q, ...) works fine, we experience a strangebug with file transfer when our resources/glideins are running on 7.6.X(tested with 7.6.10 and 7.6.7) while the user condor package is 7.8.X.When trying to transfer the output back from our dedicated schedd,condor_transfer_data will request transfer of the "_condor_stderr" and"_condor_stdout" files which do not exist, causing the process to exitwith an error [1]; this results in only the first job data being fetched(the process exits afterwards) and it will also leave the job alive inboth the queue and spool, slowly polluting our schedd node withleftovers unless manually cleaned.As far as I understand, these files are stand-ins on the remoteschedd/workers for the actual Out and Err files (i.e. "_condor_stderr"would get remapped to "path/to/$(Cluster).$(Process).err" after filetransfer to the user), yet it appears that both the transferworker->schedd AND schedd->user attempt to map them back (thus failingon the second iteration). On the schedd, the files are already stored as"/spool/<cluster.process folder>/$(Cluster).$(Process).err".Bottom line is, condor_transfer_data worked ONLY if both the user ANDthe glideins/workers were running on the same (major) version (testedwith 7.6.10 and 7.8.4). Seeing how all other condor functions usedworked flawlessly even across major versions, we are not certain if theversion mismatch is the actual cause or if there is another reason forit; the condor changelog did not mention a change to the transfer_dataprocess.

Our setup makes it very likely that we might have workers/resourcesrunning on different condor major versions, so knowing whether we alsohave to prepare remote submit packages matching all versions in use orhave some leeway there, especially in light of a smooth workflow forusers, would be very helpful.


Best regards,
Max

[1] $ condor_transfer_data -name <remote schedd> 391.0

DCSchedd::receiveJobSandbox:7003:File transfer failed for target job391.0: SCHEDD at 129.13.133.37 failed to send file(s) to<129.13.133.12:60262>: error reading from/data/srv/condor/current/condor_local/spool/391/0/cluster391.proc0.subproc0/391.0.pin.py.stderr:(errno 2) No such file or directory; TOOL failed to receive file(s) from<129.13.133.37:9615>

AUTHENTICATE:1004:Failed to authenticate using FS
ERROR: Failed to spool job files.

Follow-Ups:
- Re: [Condor-users] condor_transfer_data problem on major version switch
  - From: Ian Cottam

Prev by Date: [Condor-users] Condor Planet and the Condor Project GitHub website
Next by Date: [Condor-users] Call for Workshops: ACM HPDC 2013 -- deadline extended to November 1, 2012
Previous by thread: [Condor-users] Condor Planet and the Condor Project GitHub website
Next by thread: Re: [Condor-users] condor_transfer_data problem on major version switch
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

[Condor-users] condor_transfer_data problem on major version switch