[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Quick Start Vanilla Condor on Ubuntu 10.04



Yes, you read my problem correctly. I never did figure out what was going on, as after a reboot, this issue went away.

Thanks again for all of your help.

Dan

On 11/10/2011 08:15 AM, Matthew Farrellee wrote:
Hopefully I read your problem correctly - jobs submitted from HOST that
run on HOST get held failing to open a /net/home/user file, all other
jobs (submitted from HOST or not) running on (HOST or not) succeed.

Setting FILESYSTEM_DOMAIN to the same value across nodes means that they
have a shared filesystem, which the jobs will use. Setting UID_DOMAIN to
the same value means that the same users exist across machines (same
name,uid,gids). If either of those things are not true you can get some
non-obvious errors.

You should verify the UID_DOMAIN is set correctly. Check the StartLog
and StarterLog.slot1 on HOST to see what user the job is being started
as (you may need STARTER_DEBUG=D_FULLDEBUG in config). See if that user
differs on the non-HOSTs.

I'd normally think about rootsquash, but you said only HOST on HOST jobs
fail.

I guess you could also verify you have the privs to make/open that file
outside of condor.

Best,


matt

On 11/10/2011 04:24 AM, Lukas Slebodnik wrote:
If you want upgrade to newer version of condor using apt-get, then you
could
try to install condor from Condor Debian Repository managed by Condor
project
team. I don't know how it is compatible with ubuntu, but you can try
it and then
share experiences

Detailed informations:
http://www.cs.wisc.edu/condor/debian/

Regards,
Lukas

On Wed, Nov 09, 2011 at 05:56:47PM -0500, Daniel Grollman wrote:
Hi Matt (and all),

Thanks for the response, it totally pointed me in the right
direction, which was the filesystem. As it's shared, I had to
change the UID_DOMAIN and FILESYSTEM_DOMAIN configuration
parameters, and it all worked.

Well, almost. I've three computers in my pool now, one host and two
submit/execute machines. If I submit jobs from either of the
non-host computers, they get farmed out across all three, and all is
dandy.

However, when I submit jobs from the host, they get farmed out, and
only those on the NON-host machines actually run. The others get
held with this message:

user@HOST:~/condor_test$ condor_q -analyze
-- Submitter: HOST :<127.0.1.1:35783> : HOST
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD

---
012.003: Request is held.

Hold reason: Error from starter on slot1@HOST: Failed to open
'/net/home/user/condor_test/simple.3.out' as standard output:
Permission denied (errno 13)

I can resolve this by making those files world-writable, but doesn't
seem correct. Thoughts?

Also, I'm using 7.2.4 because it's what came down via apt-get. I'll
look into upgrading.

Thanks again,

Dan

On 11/08/2011 10:21 PM, Matthew Farrellee wrote:
On 11/08/2011 06:27 PM, Daniel Grollman wrote:
Hello Condor-users,

Is there a quick start guide for getting condor up and running on a
small ubuntu 10.04 pool? I just want to run processes on other
machine's
idle processors (vanilla universe).

Here's where I'm at if anyone can help:

2 identical (virtual) machines with fresh installs of Ubuntu 10.04
with
Condor 7.2.4 installed via 'apt-get install condor'

At this point both machines have their own local condors, and I can
queue and run jobs, no problem.

I edited the /etc/condor/condor_config files thusly:

On machine 1:
CONDOR_HOST = [IP address of machine 2]
HOSTALLOW_READ = *
HOSTALLOW_WRITE = *

On machine 2:
HOSTALLOW_READ = *
HOSTALLOW_WRITE = *

After a reboot (?) condor_status on either machine shows me the
slots on
both machines and if they're busy/idle/etc (yay!). However, they still
seem to have different queues. I.e, when I submit from machine 1, I
only
see it in condor_q on machine 1, and it only runs on the cpu of
machine
1 (but I see the usage in condor_status on machine 2).

I imagine there's a configuration parameter I need to set
somewhere, but
I don't know what. Help please?

Thanks,

Dan

You probably want ShouldTransferFiles = IF_NEEDED& WhenToTransferOutput
= ON_EXIT in your submit file.

https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=2281

7.2.4 is very old at this point, can you upgrade?

Here are some instructions you can follow, they're for Fedora, but if
you pretend apt is yum and, with 7.2.4, you throw everything in
~condor/condor_config.local instead of /etc/condor/config.d, everything
should work.

http://spinningmatt.wordpress.com/2011/06/12/getting-started-creating-a-multiple-node-condor-pool/



http://spinningmatt.wordpress.com/2011/06/21/getting-started-multiple-node-condor-pool-with-firewalls/



http://spinningmatt.wordpress.com/2011/07/04/getting-started-submitting-jobs-to-condor/



Best,


matt


_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/


--
Dan Grollman
Robot Doctor
daniel.grollman@xxxxxxxxx
http://www.vecna.com/robotics

Cambridge Research Laboratory
Vecna Technologies, Inc.
36 Cambridge Park Drive
Cambridge, MA 02140
Phone: (617) 864-0636
Fax: (617) 864-0638

Better Technology, Better World (TM)

The contents of this message may be privileged and confidential. Therefore, if this message has been received in error, please delete it without reading it. Your receipt of this message is not intended to waive any applicable privilege. Please do not disseminate this message without the permission of the author.