[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Strange toubles with Condor jobs submitted via globus



Carl Lundstedt wrote:

> Hi all,
>
> I don't know who to direct this question to.
> I have a small cluster I'm building to learn all the grid 
> middleware.  i have condor running on the machine and its WNs and I 
> can submit jobs to the machine locally just fine and they complete.  
> HOWEVER
> I installed VDT 0.4.0 to get the globus interfaces up and going and 
> everything seems fine.
> globus-run-job unlcompel1.unl.edu/jobmanager-fork /usr/bin/id
> works just as it should
> globus-run-job unlcompel1.unl.edu/jobmanager-condor /usr/bin/id
> hangs.
>
> Looking through the logs the job gets placed in the queue as a local 
> user (uscms01).
> The Shadowlog shows that its failing because:
> ERROR "Error from starter on valley003: Failed to open standard 
> output file '/home/uscms01/.globus/job/unlcompel1.unl.edu/
> 24284.113795243/stdout':Permission denied (errno13)" at line 666 in 
> file pseudo_ops.C
>
> Clearly there's a read/write privledge problem, but I can't for the 
> life of me figure it out.
> The job creates that directory when it comes in.
> When I  created the user uscms01 I passed the passwd, shadow and 
> group files down to the worker nodes and when I log into the WNs via 
> ssh uscms01 can do all the things I'd expect.
>
In my experience, this error typically means that (at least for vanilla
universe, I can't say for globus) that the
globus starter tries to run the job as nobody, which doesn't  have write
permissions in your directory.
This happens to mean when UID_DOMAIN and FS_DOMAIN don't match between
the submit host and the
compute node.  The only way I was able to debug this was to go into the
log files for the Starter on the compute node;
there I saw that due to a weird subdomain issue, Condor ignored the
UID_DOMAIN.  However, rather than attempting to
run the job in /var/spool/condor, it ran in the user's home dir.

You might want to ask the admins for that node to provide you with the
log output in the Starter for your job, as
it will clarify the problem.  Condor unfortunately does not seem to
propagate the useful error information
back to the user in this case.

Also, since it's globus universe there are a bunch of other ways this
can fail, but I have more limited experience with the globus
universe.

Dave