[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Iwd check of condor jobs



Hi all,

We are running condor version 8.5.8. Recently several condor users reported that their condor jobs were held. I looked into the issues and found that the LastHoldReason of those jobs are sth like: "Cannot access initial working directory <a path that exists on the local file system of the submit node>: No such file or directory"

I assume that this happens when users submit their jobs under a path (i.e., the current working directory) that exists only on the local file system of their submit node but not exists on the condor execute nodes or a path from NFS file system and hence not accessible from the condor execute nodes.

If my understanding is correct, condor worker daemons will check the existence of the initial working directory (a.k.a Iwd or initialdir) when they are about to run the job. If that Iwd path does not exist on the condor execute node, condor worker daemon will treat it as an error and hold the jobs. Please feel free to correct me if I am wrong.

I have been able to reproduce the issue if I submit a job from a local path that is not accessible from the condor execute node. But the thing is that those users claim to have submit jobs from a local path for years without problem and the problem only shows up recently. This puzzled me a lot.

As far as I know, there is no configuration change done recently. Any thoughts on why this could happen is appreciate? I can provide more detail on the relevant configuration settings if it can help to find out the reason.


Thanks