[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Quick Start Vanilla Condor on Ubuntu 10.04



Hopefully I read your problem correctly - jobs submitted from HOST that run on HOST get held failing to open a /net/home/user file, all other jobs (submitted from HOST or not) running on (HOST or not) succeed.

Setting FILESYSTEM_DOMAIN to the same value across nodes means that they have a shared filesystem, which the jobs will use. Setting UID_DOMAIN to the same value means that the same users exist across machines (same name,uid,gids). If either of those things are not true you can get some non-obvious errors.

You should verify the UID_DOMAIN is set correctly. Check the StartLog and StarterLog.slot1 on HOST to see what user the job is being started as (you may need STARTER_DEBUG=D_FULLDEBUG in config). See if that user differs on the non-HOSTs.

I'd normally think about rootsquash, but you said only HOST on HOST jobs fail.

I guess you could also verify you have the privs to make/open that file outside of condor.

Best,


matt

On 11/10/2011 04:24 AM, Lukas Slebodnik wrote:
If you want upgrade to newer version of condor using apt-get, then you could
try to install condor from Condor Debian Repository managed by Condor project
team. I don't know how it is compatible with ubuntu, but you can try it and then
share experiences

Detailed informations:
http://www.cs.wisc.edu/condor/debian/

Regards,
Lukas

On Wed, Nov 09, 2011 at 05:56:47PM -0500, Daniel Grollman wrote:
Hi Matt (and all),

	Thanks for the response, it totally pointed me in the right
direction, which was the filesystem.  As it's shared, I had to
change the UID_DOMAIN and FILESYSTEM_DOMAIN configuration
parameters, and it all worked.

Well, almost.  I've three computers in my pool now, one host and two
submit/execute machines.  If I submit jobs from either of the
non-host computers, they get farmed out across all three, and all is
dandy.

However, when I submit jobs from the host, they get farmed out, and
only those on the NON-host machines actually run.  The others get
held with this message:

user@HOST:~/condor_test$ condor_q -analyze
-- Submitter: HOST :<127.0.1.1:35783>  : HOST
  ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD

---
012.003:  Request is held.

Hold reason: Error from starter on slot1@HOST: Failed to open
'/net/home/user/condor_test/simple.3.out' as standard output:
Permission denied (errno 13)

I can resolve this by making those files world-writable, but doesn't
seem correct.  Thoughts?

Also, I'm using 7.2.4 because it's what came down via apt-get.  I'll
look into upgrading.

Thanks again,

Dan

On 11/08/2011 10:21 PM, Matthew Farrellee wrote:
On 11/08/2011 06:27 PM, Daniel Grollman wrote:
Hello Condor-users,

Is there a quick start guide for getting condor up and running on a
small ubuntu 10.04 pool? I just want to run processes on other machine's
idle processors (vanilla universe).

Here's where I'm at if anyone can help:

2 identical (virtual) machines with fresh installs of Ubuntu 10.04 with
Condor 7.2.4 installed via 'apt-get install condor'

At this point both machines have their own local condors, and I can
queue and run jobs, no problem.

I edited the /etc/condor/condor_config files thusly:

On machine 1:
CONDOR_HOST = [IP address of machine 2]
HOSTALLOW_READ = *
HOSTALLOW_WRITE = *

On machine 2:
HOSTALLOW_READ = *
HOSTALLOW_WRITE = *

After a reboot (?) condor_status on either machine shows me the slots on
both machines and if they're busy/idle/etc (yay!). However, they still
seem to have different queues. I.e, when I submit from machine 1, I only
see it in condor_q on machine 1, and it only runs on the cpu of machine
1 (but I see the usage in condor_status on machine 2).

I imagine there's a configuration parameter I need to set somewhere, but
I don't know what. Help please?

Thanks,

Dan

You probably want ShouldTransferFiles = IF_NEEDED&  WhenToTransferOutput
= ON_EXIT in your submit file.

https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=2281

7.2.4 is very old at this point, can you upgrade?

Here are some instructions you can follow, they're for Fedora, but if
you pretend apt is yum and, with 7.2.4, you throw everything in
~condor/condor_config.local instead of /etc/condor/config.d, everything
should work.

http://spinningmatt.wordpress.com/2011/06/12/getting-started-creating-a-multiple-node-condor-pool/


http://spinningmatt.wordpress.com/2011/06/21/getting-started-multiple-node-condor-pool-with-firewalls/


http://spinningmatt.wordpress.com/2011/07/04/getting-started-submitting-jobs-to-condor/


Best,


matt


--
Dan Grollman
Robot Doctor
daniel.grollman@xxxxxxxxx
http://www.vecna.com/robotics

Cambridge Research Laboratory
Vecna Technologies, Inc.
36 Cambridge Park Drive
Cambridge, MA 02140
Phone: (617) 864-0636
Fax: (617) 864-0638

Better Technology, Better World (TM)