[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] job hanging...PLEASE HELP!!



Hello,
I'm trying to run an mpi job on my windows grid.  The head node is now a windows machine,
and all nodes are windows machines. The node i'm submitting from is also a "Dedicated Scheduler"
as defined in :  http://www.cs.wisc.edu/condor/manual/v6.6.5/3_10Setting_Up.html#sec:Config-Dedicated-Jobs
Everything works fine, and it gets out of the job queue and onto one of the
node for execution, however it just stays there and doesn't ever leave...just keeps on "Busy":

vm2@xxxxxxxxx WINNT51     INTEL  Claimed    Busy       1.060   255 0+00:03:41

There's nothing in the log, errorlog, or output.
Here is what i do ... PLEASE HELP!!

Jon

> qsub mpi.sub

======
mpi.sub
======
universe = MPI
executable = runMPIHello.bat
log = logfile
output = outfile
error = errfile
machine_count = 2
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
getenv = true
queue

=====
runMPIHello.bat
=====
"C:\Program Files\MPICH\mpd\bin\mpirun" -np 2 -machinefile
"C:\mpiJava\examples\simple\machinefile"
"C:\mpiJava\examples\simple\runHello.bat"


=====
runMPIHello.bat
=====
java -Djava.library.path=C:\WINDOWS\SYSTEM32 -cp .;c:/mpiJava/lib/classes
Hello
================
addition to condor_config
================
######################################################################
######################################################################
##  Settings you MUST customize!
######################################################################
######################################################################
 
##  What is the name of the dedicated scheduler for this resource?
##  You MUST fill in the correct full hostname where you're running
##  the dedicated scheduler, and where users will submit their
##  dedicated jobs.  The "DedicateScheduler@" part should not be
##  changed, ONLY the hostname.
DedicatedScheduler = "DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxx"
 
 
 
######################################################################
######################################################################
##  Policy Settings (You MUST choose a policy and uncomment it)
######################################################################
######################################################################
 
##  There are three basic options for the policy on dedicated
##  resources:
##  1) Only run dedicated jobs
##  2) Always run jobs, but prefer dedicated ones
##  3) Always run dedicated jobs, but only allow non-dedicated jobs to
##     run on an opportunistic basis.  
##  You MUST uncomment the set of policy expressions you want to use
##  at your site.
 
##--------------------------------------------------------------------
## 1) Only run dedicated jobs
##--------------------------------------------------------------------
#START  = Scheduler =?= $(DedicatedScheduler)
#SUSPEND = False
#CONTINUE = True
#PREEMPT = False
#KILL  = False
#WANT_SUSPEND = False
#WANT_VACATE = False
#RANK  = Scheduler =?= $(DedicatedScheduler)
 
##--------------------------------------------------------------------
## 2) Always run jobs, but prefer dedicated ones
##--------------------------------------------------------------------
START  = True
SUSPEND = False
CONTINUE = True
PREEMPT = False
KILL  = False
WANT_SUSPEND = False
WANT_VACATE = False
RANK  = 200000
 
##--------------------------------------------------------------------
## 3) Always run dedicated jobs, but only allow non-dedicated jobs to
##    run on an opportunistic basis.  
##--------------------------------------------------------------------
##  Allowing both dedicated and opportunistic jobs on your resources
##  requires that you have an opportunistic policy already defined.
##  These are the only settings that need to be modified from your
##  existing policy expressions to allow dedicated jobs to always run
##  without suspending, or ever being preempted (either from activity
##  on the machine, or other jobs in the system).
 
#SUSPEND = Scheduler =!= $(DedicatedScheduler) && ($(SUSPEND))
#PREEMPT = Scheduler =!= $(DedicatedScheduler) && ($(PREEMPT))
#RANK_FACTOR = 1000000
#RANK = (Scheduler =?= $(DedicatedScheduler) * $(RANK_FACTOR)) + $(RANK)
#START = (Scheduler =?= $(DedicatedScheduler)) || ($(START))
 
##  Note: For everything to work, you MUST set RANK_FACTOR to be a
##  larger value than the maximum value your existing rank _expression_
##  could possibly evaluate to.  RANK is just a floating point value,
##  so there's no harm in having a value that's very large.
 

######################################################################
######################################################################
##  Settings you should leave alone, but that must be defined
######################################################################
######################################################################
 
##  Path to the special version of rsh that's required to spawn MPI
##  jobs under Condor.  WARNING: This is not a replacement for rsh,
##  and does NOT work for interactive use.  Do not use it directly!
MPI_CONDOR_RSH_PATH = $(SBIN)
 
##  This setting puts the DedicatedScheduler attribute, defined above,
##  into your machine's classad.  This way, the dedicated scheduler
##  (and you) can identify which machines are configured as dedicated
##  resources. 
STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler