[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] jobs stuck in queue



Hello

I have installed condor 7.6.0 in a master + 2 execute nodes scheme, with the 
following configuration:

*master :*
UID_DOMAIN = internal.domain
FILESYSTEM_DOMAIN = internal.domain
SEC_DEFAULT_NEGOTIATION = OPTIONAL
ALLOW_READ = $(FULL_HOSTNAME),@172.17.8.*
ALLOW_WRITE = $(FULL_HOSTNAME),@172.17.8.*
ALLOW_NEGOTIATOR = $(CONDOR_HOST)
ALLOW_CONFIG = $(CONDOR_HOST),$(FULL_HOSTNAME)
ENABLE_RUNTIME_CONFIG = True
ENABLE_PERSISTENT_CONFIG = True
PERSISTENT_CONFIG_DIR = /etc/condor/config.d
SETTABLE_ATTRS_CONFIG = *
USE_NFS         = True
DEFAULT_DOMAIN_NAME = internal.domain
TRUST_UID_DOMAIN = True
DAEMON_LIST = MASTER, STARTD, SCHEDD, COLLECTOR, NEGOTIATOR
SOFT_UID_DOMAIN = TRUE
START = TRUE


*nodes:*
CONDOR_HOST = master
UID_DOMAIN = internal.domain
FILESYSTEM_DOMAIN = internal.domain
SEC_DEFAULT_NEGOTIATION = OPTIONAL
ALLOW_READ = $(CONDOR_HOST),172.17.8.*
ALLOW_WRITE = $(CONDOR_HOST),172.17.8.*
ALLOW_NEGOTIATOR = $(CONDOR_HOST)
ALLOW_CONFIG = $(CONDOR_HOST),$(FULL_HOSTNAME)
ENABLE_RUNTIME_CONFIG = True
ENABLE_PERSISTENT_CONFIG = True
PERSISTENT_CONFIG_DIR = /etc/condor/config.d
SETTABLE_ATTRS_CONFIG = *
USE_NFS         = True
DEFAULT_DOMAIN_NAME = internal.domain
ALLOW_DAEMON = *@$(CONDOR_HOST)
SOFT_UID_DOMAIN = TRUE
START = TRUE
TRUST_UID_DOMAIN = TRUE
STARTD_EXPRS=$(STARTD_EXPRS), DedicatedScheduler, ParallelSchedulingGroup
SCHEDD_NAME = $(CONDOR_HOST)



When i submit a simple job like this:

###############################
Error           = err-$(cluster).log
Output          = out-$(cluster).log
Log             = log-$(cluster).log

cmd             = /bin/cat
arguments       = /proc/cpuinfo

Queue
###############################

It goes ok. But a little more complicated job like this:

===============================
universe        = parallel
Error           = err-$(cluster).log
Output          = out-$(cluster).log
Log             = log-$(cluster).log

executable      = /usr/bin/mpirun
arguments       = -np 8 -host node-01,node-02 /home/user/hw

machine_count   = 2

Queue
===============================

The job goes to idle state:

-- Submitter: master.internal.domain : <172.17.8.121:58829> : 
master.internal.domain
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
 33.0   user        8/19 16:48   0+00:00:00 I  0   0.1  mpirun -np 8 -host


"/home/user/hw" is just a simple mpi hello world.


Any tips to what may (not) be going on are very, very, veeeeery welcome.

TIA