[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] EncryptExecuteDirectory issues on Windows execute nodes without run_as_owner



Partly an FYI, but also a question.

 

We have recently implemented the encryption of the execute directory on windows nodes by

getting the windows submit nodes to set it for all submitted jobs:

 

# set all jobs to encrypt the execute directory on execute nodes.

JOB_TRANSFORM_NAMES = $(JOB_TRANSFORM_NAMES) Encrypt

JOB_TRANSFORM_Encrypt @=end

     #REQUIREMENTS universe =?= vanilla

    SET EncryptExecuteDirectory = true

     # optionally also force match to nodes that can encrypt.  (not all Linux nodes can encrypt)

     #SET Requirements = ( $(MY.Requirements) ) && TARGET.HasEncryptExecuteDirectory

@end

 

# Do not allow users to edit the value of EncryptExecuteDirectory after submission

# via tools like condor_qedit or chirp.

IMMUTABLE_JOB_ATTRS = $(IMMUTABLE_JOB_ATTRS) EncryptExecuteDirectory

 

We (well more accurately, âmeâ 😉) had done some testing, but unfortunately not enough. Sigh.

 

We have not yet implemented pool passwords and a credd server yet so jobs are not running as the owner

but with the dynamically created condor-slot, etc. users by HTCondor itself. This has implications for most

of our users âjobsâ which are actually batch files. The generic structure is usually something like:

 

map a network drive to the userâs fileserver

download zipped software binaries

download input data file/s

unzip software binaries

run software with the input data file/s

upload output data file/s to the userâs fileserver

disconnect network drive

 

Where the problem/error occurs is uploading the output file/s, where we get a âThe specified file could not be encryptedâ message.

Which, in hindsight, I think? makes sense as it is encrypted by user condor-slot1 and then trying to copy to a location which only

the ârealâ user has permissions to, so will cause problems.

One kludge around is to use the âcipherâ command to decrypt the file before uploading it, e.g.

 

software.exe > outputfile.dat

cipher /d /b /h outputfile.dat > nul 2>&1

copy outputfile.dat \\fileserver\user\output

 

The other alternate kludge is to redirect the output directly to the fileserver, bypassing it being encrypted on the local execute node.

It may not always be possible to do this though, depending on how the software is creating itâs output data file/s.

 

software.exe > \\fileserver\user\output\outputfile.dat

 

So thatâs the FYI bit, and once users can run_as_owner I donât think this shouldnât be a problem?

 

Now for the question part.

 

The above kludges mostly work, but there is still a small percentage (3%) of jobs, e.g. 150 out of 5,000 that give errors like:

 

120813.3    na-hit023       9/13 11:03 Error from slot1@xxxxxxxxxxx: STARTER at 152.83.xxx.xxx failed to send file(s) to <152.83.yyy.yyy:62198>: error reading from C:\PROGRA~1\condor\execute\dir_2356\_condor_stderr: (errno 13) Permission denied; SHADOW failed to receive file(s) from <152.83.xxx.xxx:50880>

120813.210  na-hit023       9/13 11:20 Error from slot3@xxxxxxxxxxx: STARTER at 138.194.aaa.aaa failed to write to file C:\PROGRA~1\condor\execute\dir_15600\condor_exec.exe: (errno 13) Permission denied

120813.579  na-hit023       9/13 11:16 Error from slot31@xxxxxxxxxxx: Failed to open 'C:\PROGRA~1\condor\execute\dir_26868\_condor_stdout' as standard output: Permission denied (errno 13)

120813.675  na-hit023       9/13 11:00 Error from slot9@xxxxxxxxxxx: Failed to open 'C:\PROGRA~1\condor\execute\dir_33056\_condor_stderr' as standard error: Permission denied (errno 13)

120813.755  na-hit023       9/13 11:00 Error from slot16@xxxxxxxxxxx: STARTER at 152.83.bbb.bbb failed to write to file C:\PROGRA~1\condor\execute\dir_22412\condor_exec.exe: (errno 13) Permission denied

 

These must be related to the encrypt_execute_directory stuff because we can re-run the jobs with NO execute directory encryption enabled

and do not get these errors.

 

Again, we can kludge around them using something like:

 

periodic_release = (JobStatus == 5) && ((HoldReasonCode == 12) || (HoldReasonCode == 13))

 

So I guess the question is does anyone have any ideas as to why these errors are occurring? And only when encryptexecutedirectory is set to true?

 

Thanks for any help/ideas/comments.

 

Cheers

 

Greg