[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Windows dedicated run account profile corrupted



I don’t think a job crash is more likely to result in failure to clean up, but that is just a guess.  

 

My best guess is that a job that starts one or more child processes is leaving a child process running, and that process is holding a lock that prevents the account profile from being deleted.   I know the Windows object model allows an active process to prevent the deletion of a file or directory, and a process that switched its User ID could escape HTCondor’s process tracking.

 

If my speculation is correct, then you should expect that this problem correlates with a specific type of job,  and it may correlate with that job exiting abnormally.  I further suspect that the user profiles are not so much corrupted as abandoned because a request was made to activate a fresh one while the current one was still locked, and this is Window’s workaround.

 

Currently the best I could think of is to detect the workaround and report that to the admin somehow. I was thinking email, but I’m open to suggestion. 

 

-tj

 

From: O'NEAL Mark <mark.oneal@xxxxxxxxxxx>
Sent: Tuesday, August 8, 2023 8:06 AM
To: John M Knoeller <johnkn@xxxxxxxxxxx>; HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: Windows dedicated run account profile corrupted

 

We have observed the corrupted profiles on both Windows 8 and Windows 10.  Currently we're not utilizing the email notifications from HTCondor for status monitoring.

 

We wondered about job shutdown, and if an application crash (i.e. exit with unhandled exception) might give the startd trouble to shutdown the job in the normal way.

 

Thanks for clarifying the load_profile implementation, that is clear now.  We have 2 small test clusters where we have implemented run_as_owner together with the cred store, but it is not used on the large cluster I reported about.

 

Mark

 


From: John M Knoeller <johnkn@xxxxxxxxxxx>
Sent: Thursday, August 3, 2023 4:28 PM
To: O'NEAL Mark <mark.oneal@xxxxxxxxxxx>; HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: RE: Windows dedicated run account profile corrupted

 

This email is not from Hexagon’s Office 365 instance. Please be careful while clicking links, opening attachments, or replying to this email.

 

Do you see the corrupted user profiles on both Windows 8 and Windows 10 ? or just on one or there other of those platforms?

 

We saw something similar to what you are describing many years ago on one of the nodes in our build farm.  It was at least 5 years ago, and that node had a failing disk, so we chalked it up at the time to the failing disk.  I think I remember that the node was Windows 8.1, but it was so long ago I cannot be sure.

 

It is certainly plausible that the cleanup of a user profile would fail if we tried to do it while a process using the profile was still running.   It is HTCondor’s responsibility to stop all processes started by a job when the job exits, so it is reasonable to consider this a HTCondor bug, but I don’t have any idea how to fix it.   Best we could manage would be to detect the left behind user directory and report it.   Do have HTCondor configured to send email to an admin when things go wrong like a daemon crash?)

 

> Does the above mean that when we want to rely on the dedicated run account, the submit configuration knob "load_profile = True" is redundant?

That means that load_profile=true is *available* with the dedicated run account, but HTCondor will not actually load a registry hive unless the job requests it.

Besides using a dedicated run account, the other option is run_as_owner=true, which only works if you have an account for the submitting user on the execute node.  run_as_owner will always load the registry hive for that user.

 

 

 


From: John M Knoeller <johnkn@xxxxxxxxxxx>
Sent: Wednesday, August 2, 2023 5:15 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: O'NEAL Mark <mark.oneal@xxxxxxxxxxx>
Subject: RE: Windows dedicated run account profile corrupted

 

This email is not from Hexagon’s Office 365 instance. Please be careful while clicking links, opening attachments, or replying to this email.

 

This is not a known HTCondor issue. 

 

I wonder if restarting Windows could clean up the user directories and registries that had been left behind?

 

-tj

 

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of O'NEAL Mark via HTCondor-users
Sent: Tuesday, August 1, 2023 7:52 PM
To: htcondor-users@xxxxxxxxxxx
Cc: O'NEAL Mark <mark.oneal@xxxxxxxxxxx>
Subject: [HTCondor-users] Windows dedicated run account profile corrupted

 

Hello,

 

We operate an HTCondor cluster under Windows utilizing the "load_profile = True" submit configuration macro and rely on the dedicated run accounts provisioned by the condor_startd running as Windows SYSTEM user.  Compute nodes running startd are a mix of Windows 8 and 10 running HTCondor 8.8.10, and are configured with static slot definitions.

 

Our IT manager recently noted that the dedicated run account profile cleanup which normally happens during job shutdown has been disrupted at some point in time on a number of these nodes, evidenced by:

  • profile folder in C:\Users (i.e. C:\Users\condor-slot1) is not deleted and appears corrupted.  Windows behavior kicks in next time the startd tries to create the dedicated run account, generating C:\Users\condor-slot1.hostname as a fallback
  • registry hive for the user condor-slot1 is not deleted

I've checked the StarterLog for a number of the slots, most show success to load the registry hive even when the issue described above is observed for that slot.  There were some which did report failure loading the registry hive in the Starter log.

 

I've done some research on the open web and haven't identified any hints where to look thus far.  I would appreciate if any one on the mailing list has suggestions where to start with log investigation or configuration setting.  We run the cluster for LAN use only behind our firewall, so have not seen a significant motivation to upgrade into the 9.x or 10.x releases.  If this were a known issue with older versions it would be a reasonable motivation to take the upgrade plunge though.

 

Best Regards,
Mark