[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] mpi job stuck as idle



It looks like you ran:

condor_status -af:h Machine DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxx

What if you run:

condor_status -af:h Machine DedicatedScheduler

This will show the value of DedicatedScheduler (and Machine) for each
slot on each execute machine.


Jason

On Mon, Jan 22, 2018 at 3:22 AM, mahmood n <nt_mahmood@xxxxxxxxx> wrote:
> Hi,
>
> Anybody help?
>
> I have stuck at this step. All I see on the web is about setting the
> hostname and policies. I have modified them. Donât know why it doesnât work
>
>
>
> Regards,
>
> Mahmood
>
>
>
> From: Mahmood Naderan
> Sent: Friday, January 19, 2018 4:19 PM
> To: HTCondor-Users Mail List; Jason Patton
> Subject: Re: [HTCondor-users] mpi job stuck as idle
>
>
>
> Jason,
>
>
>
>>Assuming you are running a recent version of condor, "condor_q" will
>
>>not show jobs from all users, but "condor_status -schedd" will show
>>totals from all users. Does the output of "condor_q -all" show more
>>jobs?
>
>
>
> No, Please see below
>
>
>
>
>
> [root@rocks7 examples]# condor_status -schedd
>
> Name                     Machine                  RunningJobs   IdleJobs
> HeldJobs
>
> rocks7.vbtestcluster.com rocks7.vbtestcluster.com           0          2
> 0
>
>                       TotalRunningJobs      TotalIdleJobs      TotalHeldJobs
>
>
>                Total                 0                  2                  0
> [root@rocks7 examples]# condor_q -all
>
>
> -- Schedd: rocks7.vbtestcluster.com : <10.0.3.15:48687> @ 01/19/18 07:45:13
> OWNER   BATCH_NAME                      SUBMITTED   DONE   RUN    IDLE
> TOTAL JOB_IDS
> mahmood CMD: /opt/openmpi/bin/mpirun   1/17 03:04      _      _      1
> 1 5.0
>
> 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended
>
>
>
>
>
>
>
>
>
>
> I followed the steps as described in the manual and uncommented the policy.
> The job is still in idle state. Should I kill it and resubmit or I  missed
> some configurations?
>
>
>
>
>
>
>
> [root@rocks7 examples]# cat condor_config.local.dedicated.resource
> ######################################################################
> ##
> ##  condor_config.local.dedicated.resource
> ##
> ##  This is the default local configuration file for any resources
> ##  that are going to be configured as dedicated resources in your
> ##  Condor pool.  If you are going to use Condor's dedicated MPI
> ##  scheduling, you must configure some of your machines as dedicated
> ##  resources, using the settings in this file.
> ##
> ##  PLEASE READ the discussion on "Configuring Condor for Dedicated
> ##  Scheduling" in the "Setting up Condor for Special Environments"
> ##  section of the Condor Manual for more details.
> ##
> ##  You should copy this file to the appropriate location and
> ##  customize it for your needs.  The file is divided into three main
> ##  parts: settings you MUST customize, settings regarding the policy
> ##  of running jobs on your dedicated resources (you must select a
> ##  policy and uncomment the corresponding expressions), and settings
> ##  you should leave alone, but that must be present for dedicated
> ##  scheduling to work.  Settings that are defined here MUST BE
> ##  DEFINED, since they have no default value.
> ##
> ######################################################################
>
>
> ######################################################################
> ######################################################################
> ##  Settings you MUST customize!
> ######################################################################
> ######################################################################
>
> ##  What is the name of the dedicated scheduler for this resource?
> ##  You MUST fill in the correct full hostname where you're running
> ##  the dedicated scheduler, and where users will submit their
> ##  dedicated jobs.  The "DedicateScheduler@" part should not be
> ##  changed, ONLY the hostname.
> DedicatedScheduler = "DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxx"
>
>
> ######################################################################
> ######################################################################
> ##  Policy Settings (You MUST choose a policy and uncomment it)
> ######################################################################
> ######################################################################
>
> ##  There are three basic options for the policy on dedicated
> ##  resources:
> ##  1) Only run dedicated jobs
> ##  2) Always run jobs, but prefer dedicated ones
> ##  3) Always run dedicated jobs, but only allow non-dedicated jobs to
> ##     run on an opportunistic basis.
> ##  You MUST uncomment the set of policy expressions you want to use
> ##  at your site.
>
> ##--------------------------------------------------------------------
> ## 1) Only run dedicated jobs
> ##--------------------------------------------------------------------
> #START        = Scheduler =?= $(DedicatedScheduler)
> #SUSPEND    = False
> #CONTINUE    = True
> #PREEMPT    = False
> #KILL        = False
> #WANT_SUSPEND    = False
> #WANT_VACATE    = False
> #RANK        = Scheduler =?= $(DedicatedScheduler)
>
> ##--------------------------------------------------------------------
> ## 2) Always run jobs, but prefer dedicated ones
> ##--------------------------------------------------------------------
> #START        = True
> #SUSPEND    = False
> #CONTINUE    = True
> #PREEMPT    = False
> #KILL        = False
> #WANT_SUSPEND    = False
> #WANT_VACATE    = False
> #RANK        = Scheduler =?= $(DedicatedScheduler)
>
> ##--------------------------------------------------------------------
> ## 3) Always run dedicated jobs, but only allow non-dedicated jobs to
> ##    run on an opportunistic basis.
> ##--------------------------------------------------------------------
> ##  Allowing both dedicated and opportunistic jobs on your resources
> ##  requires that you have an opportunistic policy already defined.
> ##  These are the only settings that need to be modified from your
> ##  existing policy expressions to allow dedicated jobs to always run
> ##  without suspending, or ever being preempted (either from activity
> ##  on the machine, or other jobs in the system).
>
> SUSPEND    = Scheduler =!= $(DedicatedScheduler) && ($(SUSPEND))
> PREEMPT    = Scheduler =!= $(DedicatedScheduler) && ($(PREEMPT))
> RANK_FACTOR    = 1000000
> RANK    = (Scheduler =?= $(DedicatedScheduler) * $(RANK_FACTOR)) + $(RANK)
> START    = (Scheduler =?= $(DedicatedScheduler)) || ($(START))
>
> ##  Note: For everything to work, you MUST set RANK_FACTOR to be a
> ##  larger value than the maximum value your existing rank expression
> ##  could possibly evaluate to.  RANK is just a floating point value,
> ##  so there's no harm in having a value that's very large.
>
>
> ######################################################################
> ######################################################################
> ##  Settings you should leave alone, but that must be defined
> ######################################################################
> ######################################################################
>
> ##  Path to the special version of rsh that's required to spawn MPI
> ##  jobs under Condor.  WARNING: This is not a replacement for rsh,
> ##  and does NOT work for interactive use.  Do not use it directly!
> MPI_CONDOR_RSH_PATH = $(LIBEXEC)
>
> ##  Path to OpenSSH server binary
> ##  Condor uses this to establish a private SSH connection between execute
> ##  machines. It is usually in /usr/sbin, but may be in /usr/local/sbin
> CONDOR_SSHD = /usr/sbin/sshd
>
> ##  Path to OpenSSH keypair generator.
> ##  Condor uses this to establish a private SSH connection between execute
> ##  machines. It is usually in /usr/bin, but may be in /usr/local/bin
> CONDOR_SSH_KEYGEN = /usr/bin/ssh-keygen
>
> ##  This setting puts the DedicatedScheduler attribute, defined above,
> ##  into your machine's classad.  This way, the dedicated scheduler
> ##  (and you) can identify which machines are configured as dedicated
> ##  resources.
> ##  Note: as of 8.4.1 this setting is automatic
> #STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler
> [root@rocks7 examples]# rocks sync host condor rocks7
> [root@rocks7 examples]# condor_status -af:h Machine
> DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxx
> Error:  Parse error of: DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxx
> [root@rocks7 examples]# condor_status -af:h Machine rocks7.vbtestcluster.com
> Machine           rocks7.vbtestcluster.com
> compute-0-0.local undefined
> compute-0-0.local undefined
> [root@rocks7 examples]# condor_q
>
>
> -- Schedd: rocks7.vbtestcluster.com : <10.0.3.15:48687> @ 01/19/18 05:22:37
> OWNER   BATCH_NAME                      SUBMITTED   DONE   RUN    IDLE
> TOTAL JOB_IDS
> mahmood CMD: /opt/openmpi/bin/mpirun   1/17 03:04      _      _      1
> 1 5.0
>
> 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended
>
> [root@rocks7 examples]#
>
>
>
>
>
>
>
>
>
> Any thought?
>
>
>
>
>
>
>
>
>
>
>
> Regards,
> Mahmood
>
>
>
>
>
>