[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] MPI - What the heck does this mean?



Hey everyone, I've been mucking around with the parallel universe and i tried
the sleeping basic program as indicated in the manual:

#############################################
##   submit description file for parallel program
#############################################
universe = parallel
executable = /bin/sleep
arguments = 30
machine_count = 2
queue

Anyhow after the job completed I got an email that stated the following:

From: condor
Message-Id: <200001011014.e01AESvB003893@xxxxxxxxxxxxxxxx>
To: condor@xxxxxxxx
Subject: [Condor] Condor Job 43.0

This is an automated email from the Condor system
on machine "panndaa.nmsu.edu".  Do not reply.

Your Condor-MPI job 43.0 has completed.

Here are the machines that ran your MPI job.
They are listed in the order they were started
in, which is the same as MPI_Comm_rank.

    Machine Name               Result
 ------------------------    -----------
         panndaa.nmsu.edu    exited normally with status 0
           gutti.nmsu.edu    was removed by the user

Have a nice day.


-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Questions about this message or Condor in general?
Email address of the local Condor administrator: condor@xxxxxxxxxxxxxxxx
The Official Condor Homepage is http://www.cs.wisc.edu/condor

So I was like, interesting but shouldn't "both" jobs exit with a status 0?
Anyone have any ideas whats going on? Below is the local config file for gutti.
It is pretty much your general run of the mill
condor_config.local.dedicated.resource modification.

DedicatedScheduler = "DedicatedScheduler@xxxxxxxxxxxxxxxx"
START           = True
SUSPEND         = False
CONTINUE        = True
PREEMPT         = False
KILL            = False
WANT_SUSPEND    = False
WANT_VACATE     = False
RANK            = Scheduler =?= $(DedicatedScheduler)
MPI_CONDOR_RSH_PATH = $(LIBEXEC)
CONDOR_SSHD = /usr/sbin/sshd
CONDOR_SSH_KEYGEN = /usr/bin/ssh-keygen
STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler

Well if anyone knows whats up or has run into this problem let me know.. Also
its weird.. even when there is nothing being executed my machines stay in the
claimed state odd....

Name          OpSys       Arch   State      Activity   LoadAv Mem   ActvtyTime

gutti.nmsu.ed LINUX       INTEL  Claimed    Idle       0.000   495[?????]
panndaa.nmsu. LINUX       INTEL  Claimed    Idle       1.110   503  0+00:03:36

                     Machines Owner Claimed Unclaimed Matched Preempting

         INTEL/LINUX        2     0       2         0       0          0

               Total        2     0       2         0       0          0

Thanks in Advance

Danny Nayar
New Mexico State University