[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] HTCondor with kerberized home directories



Dear HTCondor experts,

I am wondering how others have solved the following problem. 

We have HTCondor installed on our desktop machines for submission, and the jobs run on worker nodes in a private network. 
The desktops are naturally subject to security updates and may be rebooted about once per week. The home directories are mounted via NFSv4 with Kerberos 5 authentication. 

Now, when a desktop running a schedd is rebooted, and jobs are still running, HTCondor tries to reconnect them after reboot. 
Sadly, this has to fail, since the jobs are of course submitted from the user's home directories, and often, also the logs are saved to the home directories. 
However, the homes are not accessible without the Kerberos 5 TGT of the user. 

So we see the following funny message in our logs:
-------------------------------------
Jun 11 17:33:38 cip000 condor_shadow[1968]: Path does not exist.
                                            He who travels without bounds
                                            Can't locate data.
Jun 11 17:33:38 cip000 condor_shadow[1968]: Cannot access initial working directory /gpfs/share/home/freyermu/jobs: Permission denied
Jun 11 17:33:38 cip000 condor_shadow[1968]: Job 4.0 going into Hold state (code 14,13): Cannot access initial working directory /gpfs/share/home/freyermu/jobs: Permission denied
-------------------------------------
and then, the Job Claim is released. 

If the logfile is on the shared home directory, I get:
-------------------------------------
Jun 11 17:14:09 cip000 condor_shadow[2292]: Initializing a VANILLA shadow for job 3.0
Jun 11 17:14:09 cip000 condor_shadow[2292]: WriteUserLog::initialize: safe_open_wrapper("/gpfs/share/home/freyermu/jobs/logs/log.0") failed - errno 13 (Permission denied)
Jun 11 17:14:09 cip000 condor_shadow[2292]: WriteUserLog::initialize: failed to open file /gpfs/share/home/freyermu/jobs/logs/log.0
Jun 11 17:14:09 cip000 condor_shadow[2292]: Failed to initialize user log to /gpfs/share/home/freyermu/jobs/logs/log.0
Jun 11 17:14:09 cip000 condor_shadow[2292]: Job 3.0 going into Hold state (code 22,0): Failed to initialize user log to /gpfs/share/home/freyermu/jobs/logs/log.0
Jun 11 17:14:09 cip000 condor_shadow[2292]: RemoteResource::killStarter(): DCStartd object NULL!
Jun 11 17:14:10 cip000 condor_shadow[2292]: **** condor_shadow (condor_SHADOW) pid 2292 EXITING WITH STATUS 112
Jun 11 17:14:10 cip000 condor_schedd[2141]: Shadow pid 2292 for job 3.0 exited with status 112
-------------------------------------

How are others solving this? 
Is the only way to have some kind of scratch space somewhere, with unix auth? 

Since we use Singularity containers, sadly checkpointing does not work, so even if we let the jobs stay on hold and only let them continue once the user comes back and logs in,
it will have to start from the very beginning as of then. 

Cheers,
	Oliver

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature