[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Job Submission in Parallel Universe



Hello all thankx for your great help..
I think I forgot to mentioned that I'm running jobs on personal condor.
till now I hadn't made my own cluster so I jst wanna know is this job is executable on personal condor.


Greetings.

On 04/09/2013 12:13 AM, Andrew Kuelbs wrote:

 

I am running into an issue with my parallel universe jobs as well.  I have just installed condor as these instructions mentioned, with a few other alterations to customize my environment.

I am running RHEL 6.3 Server 64-bit on my master and compute nodes (about 50).  I am running Condor 7.8.7.  I have tried a few parallel mpi scripts including the one mentioned here in this thread. 

 

I come up with the following error in the log for that script in /var/log/codor/ShadowLog:

 

04/08/13 14:06:26 Initializing a PARALLEL shadow for job 50.0

04/08/13 14:06:26 (50.0) (24348): condor_write(): Socket closed when trying to write 37 bytes to daemon at <Server.IP.Cleaned:46094>, fd is 5

04/08/13 14:06:26 (50.0) (24348): Buf::write(): condor_write() failed

04/08/13 14:06:26 (50.0) (24348): ChildAliveMsg: failed to send DC_CHILDALIVE to parent daemon at <Server.IP.Cleaned:46094> (try 1 of 3): CEDAR:6002:failed to send EOM

04/08/13 14:06:27 (50.0) (24348): ERROR "Failed to get number of procs" at line 241 in file /slots/01/dir_65060/userdir/src/condor_shadow.V6.1/parallelshadow.cpp

 

 

 

Dedicated Server’s /etc/condor/condor_config.local

 

##  What machine is your central manager?

CONDOR_HOST = Server.Name.Cleaned

## Pool's short description

COLLECTOR_NAME = "Server.Name.Cleaned"

 

#FOR MPI and other Parallel Universe runs

Scheduler = "DedicatedScheduler@xxxxxxxxxxxxxxxxxxx"

 

##  When is this machine willing to start a job?

START = TRUE

##  When to suspend a job?

SUSPEND = FALSE

##  When to nicely stop a job?

##  (as opposed to killing it instantaneously)

PREEMPT = FALSE

##  When to instantaneously kill a preempting job

##  (e.g. if a job is in the pre-empting stage for too long)

KILL = FALSE

##  This macro determines what daemons the condor_master will start and keep its watchful eyes on.

##  The list is a comma or space separated list of subsystem names

 

DAEMON_LIST = COLLECTOR, MASTER, NEGOTIATOR, SCHEDD

 

- - - - - - - - - - - - - - - - - - - - -

All Compute Node’s /etc/condor/condor_config.local

 

##  What machine is your central manager?

CONDOR_HOST = Server.Name.Cleaned

 

## Pool's short description

COLLECTOR_NAME = "Server.Name.Cleaned"

 

#FOR Parallel MPI files to run

DedicatedScheduler = "DedicatedScheduler@xxxxxxxxxxxxxxxxxxx"

STARTD_ATTRS = $(STARTD_ATTRS), DedicatedScheduler

 

CONTINUE = True

WANT_SUSPEND = False

WANT_VACATE = False

RANK = Scheduler =?= $(DedicatedScheduler)

 

##  When is this machine willing to start a job?

START = TRUE

 

##  When to suspend a job?

SUSPEND = FALSE

 

##  When to nicely stop a job?

##  (as opposed to killing it instantaneously)

PREEMPT = FALSE

 

##  When to instantaneously kill a preempting job

##  (e.g. if a job is in the pre-empting stage for too long)

KILL = FALSE

 

##  This macro determines what daemons the condor_master will start and keep its watchful eyes on.

##  The list is a comma or space separated list of subsystem names

DAEMON_LIST = MASTER, STARTD

STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler

 

----------

 

 

Note the STARTD_EXPRS and the STARTD_ATTRS  the online manual referenced has STARTD_ATTRS  but the example file for Condor 7.8.7 says STARTD_EXPRS  so I put both in.  When I run a script it just hangs with the following in the job’s log:

 

 

007 (049.000.000) 04/08 13:39:56 Shadow exception!

                Failed to get number of procs

                0  -  Run Bytes Sent By Job

                0  -  Run Bytes Received By Job

...

007 (049.000.000) 04/08 13:40:00 Shadow exception!

                Failed to get number of procs

                0  -  Run Bytes Sent By Job

                0  -  Run Bytes Received By Job

...

009 (049.000.000) 04/08 13:40:00 Job was aborted by the user.

                via condor_rm (by user condor)

 

 

I tried just tossing in the example script as it was with the server name changed and simply get the same results.  A condor_status lists all the Compute Nodes in an unclaimed Idle state.  When I specify machine_count=X . X number of nodes become claimed but the job just sits idle.  Is there anyone who has any thoughts on this?

 

 

 

 

From: htcondor-users-bounces@xxxxxxxxxxx [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of David Hentchel
Sent: Monday, April 08, 2013 11:39 AM
To: jerome.leconte@xxxxxxxxxxxxxxx; HTCondor-Users Mail List
Subject: Re: [HTCondor-users] Job Submission in Parallel Universe

 

I just went through this as a beginner.  The key is to set up every host running a Start daemon to reference a "dedicated scheduler" in the pool, according the the instructions in the manual section labelled "3.12.8 HTCondor's Dedicated Scheduling".  You can also merge the examples in the condor install (see $INSTALL_DIR/etc/examples/condor_config.local.dedicated.resource).  All I had to change from that example was the hostname for the machine I wanted to use for the dedicated scheduler.

 

Hope this helps.

 

On Mon, Apr 8, 2013 at 8:45 AM, leconte <jerome.leconte@xxxxxxxxxxxxxxx> wrote:

Le 08/04/2013 09:53, Muak rules a écrit :

I'm submitting my first parallel universe job. which is basically a simple hello world problem.
I'm using CentOs 6.3 and had installed HTCondor using "yum install condor".
I'm not sure about about version of HTCondor. When I submit it the job goes on idle state.
Please help me out through this......Following is my submit description file.


universe=parallel
executable=mpi_hello_world
machine_count=1
log=hello.log
out=hello.out
queue

The following attachment contain my job.

                                       
   

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

hello Muak,

 I have testing your prog and submit file on my own test cluster. it work fine.

I suppose it is not your prog nor submit file but your configuration that cause the problem.

I've corrected only the line
out=hello.out
by
output=hello.out
my condor complains about it.

I don't know if I can correct your problem but can you post your cluster config  ?

Greetings


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/



 

--

David Hentchel

Performance Engineer

www.nuodb.com

(617) 803 - 1193



_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/