[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] mpi job stuck as idle



Hi,

Anybody help?

I have stuck at this step. All I see on the web is about setting the hostname and policies. I have modified them. Donât know why it doesnât work

 

Regards,

Mahmood

 

From: Mahmood Naderan
Sent: Friday, January 19, 2018 4:19 PM
To: HTCondor-Users Mail List; Jason Patton
Subject: Re: [HTCondor-users] mpi job stuck as idle

 

Jason,

 

>Assuming you are running a recent version of condor, "condor_q" will

>not show jobs from all users, but "condor_status -schedd" will show
>totals from all users. Does the output of "condor_q -all" show more
>jobs?

 

No, Please see below

 

 

[root@rocks7 examples]# condor_status -schedd

Name                     Machine                  RunningJobs   IdleJobs   HeldJobs

rocks7.vbtestcluster.com rocks7.vbtestcluster.com           0          2          0

                      TotalRunningJobs      TotalIdleJobs      TotalHeldJobs

                    
               Total                 0                  2                  0
[root@rocks7 examples]# condor_q -all


-- Schedd: rocks7.vbtestcluster.com : <10.0.3.15:48687> @ 01/19/18 07:45:13
OWNER   BATCH_NAME                      SUBMITTED   DONE   RUN    IDLE  TOTAL JOB_IDS
mahmood CMD: /opt/openmpi/bin/mpirun   1/17 03:04      _      _      1      1 5.0

1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended




 

 

 

I followed the steps as described in the manual and uncommented the policy. The job is still in idle state. Should I kill it and resubmit or I  missed some configurations?

 

 

 

[root@rocks7 examples]# cat condor_config.local.dedicated.resource
######################################################################
##
##  condor_config.local.dedicated.resource
##
##  This is the default local configuration file for any resources
##  that are going to be configured as dedicated resources in your
##  Condor pool.  If you are going to use Condor's dedicated MPI
##  scheduling, you must configure some of your machines as dedicated
##  resources, using the settings in this file.
##
##  PLEASE READ the discussion on "Configuring Condor for Dedicated
##  Scheduling" in the "Setting up Condor for Special Environments"
##  section of the Condor Manual for more details.
##
##  You should copy this file to the appropriate location and
##  customize it for your needs.  The file is divided into three main
##  parts: settings you MUST customize, settings regarding the policy
##  of running jobs on your dedicated resources (you must select a
##  policy and uncomment the corresponding expressions), and settings
##  you should leave alone, but that must be present for dedicated
##  scheduling to work.  Settings that are defined here MUST BE
##  DEFINED, since they have no default value.
##
######################################################################


######################################################################
######################################################################
##  Settings you MUST customize!
######################################################################
######################################################################

##  What is the name of the dedicated scheduler for this resource?
##  You MUST fill in the correct full hostname where you're running
##  the dedicated scheduler, and where users will submit their
##  dedicated jobs.  The "DedicateScheduler@" part should not be
##  changed, ONLY the hostname.
DedicatedScheduler = "DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxx"


######################################################################
######################################################################
##  Policy Settings (You MUST choose a policy and uncomment it)
######################################################################
######################################################################

##  There are three basic options for the policy on dedicated
##  resources:
##  1) Only run dedicated jobs
##  2) Always run jobs, but prefer dedicated ones
##  3) Always run dedicated jobs, but only allow non-dedicated jobs to
##     run on an opportunistic basis.   
##  You MUST uncomment the set of policy expressions you want to use
##  at your site.

##--------------------------------------------------------------------
## 1) Only run dedicated jobs
##--------------------------------------------------------------------
#START        = Scheduler =?= $(DedicatedScheduler)
#SUSPEND    = False
#CONTINUE    = True
#PREEMPT    = False
#KILL        = False
#WANT_SUSPEND    = False
#WANT_VACATE    = False
#RANK        = Scheduler =?= $(DedicatedScheduler)

##--------------------------------------------------------------------
## 2) Always run jobs, but prefer dedicated ones
##--------------------------------------------------------------------
#START        = True
#SUSPEND    = False
#CONTINUE    = True
#PREEMPT    = False
#KILL        = False
#WANT_SUSPEND    = False
#WANT_VACATE    = False
#RANK        = Scheduler =?= $(DedicatedScheduler)

##--------------------------------------------------------------------
## 3) Always run dedicated jobs, but only allow non-dedicated jobs to
##    run on an opportunistic basis.   
##--------------------------------------------------------------------
##  Allowing both dedicated and opportunistic jobs on your resources
##  requires that you have an opportunistic policy already defined.
##  These are the only settings that need to be modified from your
##  existing policy expressions to allow dedicated jobs to always run
##  without suspending, or ever being preempted (either from activity
##  on the machine, or other jobs in the system).

SUSPEND    = Scheduler =!= $(DedicatedScheduler) && ($(SUSPEND))
PREEMPT    = Scheduler =!= $(DedicatedScheduler) && ($(PREEMPT))
RANK_FACTOR    = 1000000
RANK    = (Scheduler =?= $(DedicatedScheduler) * $(RANK_FACTOR)) + $(RANK)
START    = (Scheduler =?= $(DedicatedScheduler)) || ($(START))

##  Note: For everything to work, you MUST set RANK_FACTOR to be a
##  larger value than the maximum value your existing rank _expression_
##  could possibly evaluate to.  RANK is just a floating point value,
##  so there's no harm in having a value that's very large.


######################################################################
######################################################################
##  Settings you should leave alone, but that must be defined
######################################################################
######################################################################

##  Path to the special version of rsh that's required to spawn MPI
##  jobs under Condor.  WARNING: This is not a replacement for rsh,
##  and does NOT work for interactive use.  Do not use it directly!
MPI_CONDOR_RSH_PATH = $(LIBEXEC)

##  Path to OpenSSH server binary
##  Condor uses this to establish a private SSH connection between execute
##  machines. It is usually in /usr/sbin, but may be in /usr/local/sbin
CONDOR_SSHD = /usr/sbin/sshd

##  Path to OpenSSH keypair generator.
##  Condor uses this to establish a private SSH connection between execute
##  machines. It is usually in /usr/bin, but may be in /usr/local/bin
CONDOR_SSH_KEYGEN = /usr/bin/ssh-keygen

##  This setting puts the DedicatedScheduler attribute, defined above,
##  into your machine's classad.  This way, the dedicated scheduler
##  (and you) can identify which machines are configured as dedicated
##  resources.
##  Note: as of 8.4.1 this setting is automatic
#STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler
[root@rocks7 examples]# rocks sync host condor rocks7
[root@rocks7 examples]# condor_status -af:h Machine DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxx
Error:  Parse error of: DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxx
[root@rocks7 examples]# condor_status -af:h Machine rocks7.vbtestcluster.com
Machine           rocks7.vbtestcluster.com
compute-0-0.local undefined               
compute-0-0.local undefined               
[root@rocks7 examples]# condor_q


-- Schedd: rocks7.vbtestcluster.com : <10.0.3.15:48687> @ 01/19/18 05:22:37
OWNER   BATCH_NAME                      SUBMITTED   DONE   RUN    IDLE  TOTAL JOB_IDS
mahmood CMD: /opt/openmpi/bin/mpirun   1/17 03:04      _      _      1      1 5.0

1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended

[root@rocks7 examples]#

 

 

 

 

Any thought?

 

 

 

 

 

Regards,
Mahmood