[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Jobs do not execute, they sit idle in the queue indefinitely



Adding STARTD to the gatekeeper node caused all jobs queued to be
executed on the gatekeeper.
It seems the gatekeeper machine can not see the execute-only nodes?
I'm not sure what I have missed in the configuration to cause this
behaviour?  Network wise they all see each other just fine, hostnames
resolved via /etc/hosts entries.

Dan

On 05/17/2013 02:21 PM, Dan Shea wrote:
> Hi,
>
> I'm attempting to configure a test condor cluster.  I have 10 machines
> all running Centos 6.4
> They are not configured with DNS records, they all have /etc/hosts files
> that contain the relevant ip addresses for each node in the cluster.
>
> I've configured the stable repo and used that to install the condor
> software.
> I then modified the /etc/condor/condor_config so that the subnet these
> machines reside on was enabled for write access.
>
> A quick test showed everything was working and jobs would execute as
> expected.
> However, this was with the following condor_config.local entry on each
> of the 10 nodes
>
> DAEMON_LIST = COLLECTOR, MASTER, NEGOTIATOR, SCHEDD, STARTD
>
> I am now attempting to configured one node as a gatekeeper
> DAEMON_LIST = COLLECTOR, MASTER, NEGOTIATOR, SCHEDD
>
> And the other 9 nodes as execution only nodes
> DAEMON_LIST = MASTER, STARTD
>
> After restarting services I now no longer see jobs executing. They sit
> idle in the queue indefinitely.
>
> [root@node00 condor]# condor_q
>
>
> -- Submitter: node00 : <10.11.114.220:44213> : node00
>  ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE
> CMD              
>    2.0   mfs             5/17 13:41   0+00:00:00 I  0   0.0  myprog
> Example.2.0
>
> 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended
>
> condor_q -analyze is not much help
>
> -- Submitter: node00 : <10.11.114.220:44213> : node00
> ---
> 002.000:  Request has not yet been considered by the matchmaker.
>
> I did notice the following warning in the SchedLog
>
> SchedLog:05/17/13 13:41:21 (pid:9037) WARNING: forward resolution of
> localhost.localdomain doesn't match 10.11.114.220!
>
> I also found this entry which makes no sense to me since schedd is not
> setup to run on node00 in the local config.
>
> SchedLog:05/17/13 13:56:21 (pid:9037) Can't find address for startd node00
>
> The test job itself is from the tutorial here:
> http://research.cs.wisc.edu/htcondor/tutorials/scotland-admin-tutorial-2003-10-23/scotland-admin-tutorial-2003-10-23.DEMO.html
>
> Any assistance pointing me in the right direction is greatly appreciated.
>
> Regards,
> Dan Shea
>


-- 
Dan Shea - daniel_shea2@xxxxxxxxxxxxxxx
Senior Systems Administrator, West Quad Computing Group
Harvard Medical School
"Charlie was a chemist, But Charlie is no more. For what he thought was H2O, Was H2SO4."