[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] EncryptExecuteDirectory issues on Windows execute nodes without run_as_owner



Hi Todd

The re-run is with all 5,000 jobs and with no errors occurring if encrypt_execute_directory is false.

I think some sort of race condition is likely as it seems? worse with nodes with more cores/slots.

I re-ran just 50 jobs, and targeted (via the requirements statement) a single windows execute node that
has 36 cores/slots (2 x Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz - 18 cores/cpu).

At one stage there were 24/50 jobs on hold:

120822.0   na-hit023       9/15 10:50 Error from slot1@xxxxxxxxxxxxxxx: STARTER at xxx.yyy.160.92 failed to send file(s) to <aaa.bbb.106.167:56553>: error reading from C:\PROGRA~1\condor\execute\dir_16464\_condor_stdout: (errno 13) Permission denied; SHADOW failed to receive file(s) from <xxx.yyy.160.92:63593>
120822.1   na-hit023       9/15 10:51 Error from slot2@xxxxxxxxxxxxxxx: STARTER at xxx.yyy.160.92 failed to send file(s) to <aaa.bbb.106.167:56582>: error reading from C:\PROGRA~1\condor\execute\dir_14952\_condor_stderr: (errno 13) Permission denied; SHADOW failed to receive file(s) from <xxx.yyy.160.92:63687>
120822.2   na-hit023       9/15 10:51 Error from slot3@xxxxxxxxxxxxxxx: STARTER at xxx.yyy.160.92 failed to send file(s) to <aaa.bbb.106.167:56581>: error reading from C:\PROGRA~1\condor\execute\dir_18480\_condor_stdout: (errno 13) Permission denied; SHADOW failed to receive file(s) from <xxx.yyy.160.92:63699>
120822.3   na-hit023       9/15 10:52 Error from slot4@xxxxxxxxxxxxxxx: STARTER at xxx.yyy.160.92 failed to send file(s) to <aaa.bbb.106.167:56583>: error reading from C:\PROGRA~1\condor\execute\dir_32332\_condor_stderr: (errno 13) Permission denied; SHADOW failed to receive file(s) from <xxx.yyy.160.92:63757>
120822.4   na-hit023       9/15 10:51 Error from slot5@xxxxxxxxxxxxxxx: STARTER at xxx.yyy.160.92 failed to send file(s) to <aaa.bbb.106.167:56584>: error reading from C:\PROGRA~1\condor\execute\dir_33716\_condor_stderr: (errno 13) Permission denied; SHADOW failed to receive file(s) from <xxx.yyy.160.92:63751>
120822.11  na-hit023       9/15 10:52 Error from slot13@xxxxxxxxxxxxxxx: STARTER at xxx.yyy.160.92 failed to send file(s) to <aaa.bbb.106.167:56645>: error reading from C:\PROGRA~1\condor\execute\dir_6656\_condor_stderr: (errno 13) Permission denied; SHADOW failed to receive file(s) from <xxx.yyy.160.92:63756>
120822.13  na-hit023       9/15 10:50 Error from slot15@xxxxxxxxxxxxxxx: STARTER at xxx.yyy.160.92 failed to write to file C:\PROGRA~1\condor\execute\dir_13636\condor_exec.exe: (errno 13) Permission denied
120822.14  na-hit023       9/15 10:50 Error from slot16@xxxxxxxxxxxxxxx: Failed to open 'C:\PROGRA~1\condor\execute\dir_29624\_condor_stdout' as standard output: Permission denied (errno 13)
120822.15  na-hit023       9/15 10:50 Error from slot17@xxxxxxxxxxxxxxx: STARTER at xxx.yyy.160.92 failed to write to file C:\PROGRA~1\condor\execute\dir_23092\condor_exec.exe: (errno 13) Permission denied
120822.16  na-hit023       9/15 10:50 Error from slot18@xxxxxxxxxxxxxxx: Failed to open 'C:\PROGRA~1\condor\execute\dir_13068\_condor_stdout' as standard output: Permission denied (errno 13)
120822.17  na-hit023       9/15 10:50 Error from slot19@xxxxxxxxxxxxxxx: Failed to open 'C:\PROGRA~1\condor\execute\dir_11092\_condor_stdout' as standard output: Permission denied (errno 13)
120822.18  na-hit023       9/15 10:50 Error from slot20@xxxxxxxxxxxxxxx: Failed to open 'C:\PROGRA~1\condor\execute\dir_22992\_condor_stdout' as standard output: Permission denied (errno 13)
120822.20  na-hit023       9/15 10:50 Error from slot22@xxxxxxxxxxxxxxx: STARTER at xxx.yyy.160.92 failed to write to file C:\PROGRA~1\condor\execute\dir_11184\condor_exec.exe: (errno 13) Permission denied
120822.21  na-hit023       9/15 10:50 Error from slot23@xxxxxxxxxxxxxxx: STARTER at xxx.yyy.160.92 failed to write to file C:\PROGRA~1\condor\execute\dir_19296\condor_exec.exe: (errno 13) Permission denied
120822.22  na-hit023       9/15 10:50 Error from slot24@xxxxxxxxxxxxxxx: STARTER at xxx.yyy.160.92 failed to write to file C:\PROGRA~1\condor\execute\dir_31936\condor_exec.exe: (errno 13) Permission denied
120822.25  na-hit023       9/15 10:50 Error from slot27@xxxxxxxxxxxxxxx: STARTER at xxx.yyy.160.92 failed to write to file C:\PROGRA~1\condor\execute\dir_10332\condor_exec.exe: (errno 13) Permission denied
120822.26  na-hit023       9/15 10:51 Error from slot28@xxxxxxxxxxxxxxx: STARTER at xxx.yyy.160.92 failed to send file(s) to <aaa.bbb.106.167:56725>: error reading from C:\PROGRA~1\condor\execute\dir_29008\_condor_stderr: (errno 13) Permission denied; SHADOW failed to receive file(s) from <xxx.yyy.160.92:63725>
120822.27  na-hit023       9/15 10:50 Error from slot29@xxxxxxxxxxxxxxx: STARTER at xxx.yyy.160.92 failed to write to file C:\PROGRA~1\condor\execute\dir_21172\condor_exec.exe: (errno 13) Permission denied
120822.29  na-hit023       9/15 10:50 Error from slot31@xxxxxxxxxxxxxxx: STARTER at xxx.yyy.160.92 failed to write to file C:\PROGRA~1\condor\execute\dir_28180\condor_exec.exe: (errno 13) Permission denied
120822.30  na-hit023       9/15 10:50 Error from slot32@xxxxxxxxxxxxxxx: STARTER at xxx.yyy.160.92 failed to write to file C:\PROGRA~1\condor\execute\dir_33184\condor_exec.exe: (errno 13) Permission denied
120822.32  na-hit023       9/15 10:50 Error from slot34@xxxxxxxxxxxxxxx: STARTER at xxx.yyy.160.92 failed to write to file C:\PROGRA~1\condor\execute\dir_22980\condor_exec.exe: (errno 13) Permission denied
120822.33  na-hit023       9/15 10:50 Error from slot35@xxxxxxxxxxxxxxx: STARTER at xxx.yyy.160.92 failed to write to file C:\PROGRA~1\condor\execute\dir_17172\condor_exec.exe: (errno 13) Permission denied
120822.34  na-hit023       9/15 10:50 Error from slot36@xxxxxxxxxxxxxxx: STARTER at xxx.yyy.160.92 failed to write to file C:\PROGRA~1\condor\execute\dir_17596\condor_exec.exe: (errno 13) Permission denied
120822.47  na-hit023       9/15 10:51 Error from slot17@xxxxxxxxxxxxxxx: STARTER at xxx.yyy.160.92 failed to write to file C:\PROGRA~1\condor\execute\dir_30664\condor_exec.exe: (errno 13) Permission denied

but 9 jobs still ran to completion OK, before I killed the rest of the jobs.

I ran this with full debug (ALL_DEBUG = D_FULLDEBUG) on both submit and execute nodes but
there didn't seem to be any extra info in the logs that explained what was happening.

I can send you the logs offline if you think that may help.

Meanwhile I'll try the output remap as another way of getting the output file
onto the fileserver, although that is a separate issue to the above errors.

Thanks

Cheers

Greg

P.S. I ran the 50 jobs twice more, running on the one execute node, each time with periodic_release set to true.
Theses jobs just chew cpu for 5 mins, plus file download/upload times. I have attached a ganglia graph of the jobs progress for each run.

Run 1 - encrypt_execute_directory = true

50 jobs took 21 mins total throughput time.
30 jobs were put on "hold" at some stage, 18 once, 10 twice, 2 three times.
All eventually ran to completion.

Run 2 - encrypt_execute_directory = false

50 jobs took 13 mins total throughput time.
No jobs were put on hold.




-----Original Message-----
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Todd L Miller
Sent: Wednesday, 15 September 2021 2:14 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] EncryptExecuteDirectory issues on Windows execute nodes without run_as_owner

> One kludge around is to use the âcipherâ command to decrypt the file 
> before uploading it, e.g.

 	You could also potentially use HTCondor's file-transfer mechanism, 
although it will end up being a little less efficient in this case: if the 
submit node can mount \\fileserver, your jobs could terminate after 
creating outputfile.dat but specify

transfer_output_files = outputfile.dat
transfer_output_remaps = outputfile.dat=\\fileserver\user\output

HTCondor will read outputfile.dat as the condor-slot user and transfer if 
to a daemon running on the submit node as the owner of the job, which
(should) allow that daemon to write to \\fileserver\user\output.

> So thatâs the FYI bit, and once users can run_as_owner I donât think 
> this shouldnât be a problem?

 	Indeed.

> These must be related to the encrypt_execute_directory stuff because we 
> can re-run the jobs with NO execute directory encryption enabled and do 
> not get these errors.

 	Do you re-run all 5,000 jobs and get no failures, or just the 
failed 150?

> So I guess the question is does anyone have any ideas as to why these 
> errors are occurring? And only when encryptexecutedirectory is set to 
> true?

 	I'm a little more worried by failing to read from the standard 
error log after the job has finished than the two errors failing to 
create the log files.  Failing to write to the log after creating it is 
also very strange.  It makes me wonder if there's a clean-up process going 
astray somewhere, possibly because of a race condition made worse by 
encrypting the execute directory.

- ToddM

Attachment: jobs.JPG
Description: jobs.JPG