[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Problems setting up condor on local node, jobs do not start



On Sep 13, 2013, at 6:39 AM, Alex Seeholzer <alex.seeholzer@xxxxxxx> wrote:

> hi condor-users
> 
> I am trying to set up condor 8.1.0 on a local ubuntu 12.04 cluster, and running into quite a few problems even on a single node setup with fairly standard config files. Here is my progression so far:
> 
> - ps -efwwww | grep condor_ gives
> condor   21958     1  0 13:05 ?        00:00:00 /usr/sbin/condor_master -pidfile /var/run/condor/condor.pid
> root     21961 21958  0 13:05 ?        00:00:00 condor_procd -A /var/run/condor/procd_pipe -L /var/log/condor/ProcLog -R 10000000 -S 60 -C 124
> condor   21962 21958  0 13:05 ?        00:00:00 condor_collector -f
> condor   21963 21958  0 13:05 ?        00:00:00 condor_negotiator -f
> condor   21964 21958  0 13:05 ?        00:00:00 condor_schedd -f
> condor   21965 21958  0 13:05 ?        00:00:00 condor_startd -f
> 
> - 	condor_status returns nothing with the vanilla config files, I had to set ALLOW_WRITE = * to get any nodes to appear. Even setting the own machines IP manually did not work. If I set ALLOW_WRITE = * I can continue, although this is not really satisfactory

Before opening to the world, try looking at /var/log/condor/CollectorLog and look for PERMISSION DENIED lines.

> - 	submitting test jobs does not work. jobs are listed in condor_q as idle. I have 8 available nodes to run the job.
> - 	running condor-q -analyze shows me that they have not been considered by the matchmaker, checking in NegotiatorLog gives me a
> 	condor_read() failed: recv(fd=8) returned -1, errno = 104 Connection reset by peer, reading 5 bytes from collector
> -	If I change
> 	ALLOW_NEGOTIATOR = $(CONDOR_HOST), $(IP_ADDRESS) -> ALLOW_NEGOTIATOR = *

Again, I'd look at the CollectorLog to see why your hosts are getting denied.

> 	jobs seem to get started but then I get:
> 	
> 	Error from slot3@mynodename: Failed to open 'myhomedir/testjob/first.job.10.2.out' as standard output: Permission denied (errno 13)
> 

What's the UID_DOMAIN on each host?  If they are not equal between the worker node and submit node, then the job will run as user 'nobody'.

> Any ideas on how to fix this?
> Thanks, alex
> 
> Remark:
> I chose the dev 8.1.0 channel, since 8.0.2 still has python2.6 bindings which I could not provide in ubuntu 12.04 without further hassle. I went through further hassle, however, and this does not change the behaviour described above.

Although they're not posted on the website, HTCondor does a build for Ubuntu12 (linking against python2.7).  The nightlies are here:

http://submit-2.batlab.org/results/continuous.php

(note these are indeed nightlies, not releases.  I don't know where to find the release tarballs for Ubuntu 12).

TimT -- any reason the Ubuntu release can't be posted on the website alongside Debian?

Brian