[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Job Submission in Parallel Universe



M.,

There are many config options that are not included in the "default" configuration files created in the installation process.
Also, Condor lets you set up many config files some shared across all hosts others local to a certain host.

I was referencing Andrew's configuration, earlier in this thread.  The goal is to identify a unique, dedicated scheduler so that there's no risk of some other scheduler interfering with your Parallel Universe runs.  You declare this dedicated scheduler on the machine running the dedicated scheduler by adding to any of its config files:

Scheduler = "DedicatedScheduler@Server.Name.Cleaned"

Then, in the condor_config.local file for every host that will be used to execute parallel jobs, you associate the start daemon with that named scheduler:

DedicatedScheduler = "DedicatedScheduler@xxxxxxxxxxxxxxxxxxx"

STARTD_ATTRS = $(STARTD_ATTRS), DedicatedScheduler

(but note my remark above: if you are running Personal Condor you need to include the user name in the host string).  Andrew also had some other configuration settings in his example that ensure parallel jobs get properly scheduled.  You should look at his code, and also the example provided in your condor installation (mentioned in my earlier post on this thread): $INSTALL_DIR/etc/examples/condor_config.local.dedicated.resource.


This would be a good discussion, along with a nice description of how Parallel Universe jobs compare to other modes, to include in the documentation.  Us Condor users could even provide draft versions of this kind of information if we had a developer's Wiki we could update. 


d.




On Tue, Apr 9, 2013 at 3:45 AM, Muak rules <muakrules@xxxxxxxx> wrote:
Hello

Will you plz tell me in which file I should add this

 
Scheduler = "DedicatedScheduler@MY_USERNAME@Server.Name.Cleaned"

as I'm not finding this entry in "condor_config" and "condor_config.local"

Date: Mon, 8 Apr 2013 16:29:05 -0400
From: dhentchel@xxxxxxxxx
To: htcondor-users@xxxxxxxxxxx

Subject: Re: [HTCondor-users] Job Submission in Parallel Universe

I don't know whether there may be other limitations that could get in the way, but I do know that if using Personal Condor you need to qualify the hostname strings with your user name, e.g.:
Scheduler = "DedicatedScheduler@MY_USERNAME@Server.Name.Cleaned"



On Mon, Apr 8, 2013 at 3:49 PM, Usman Khan <muakrules@xxxxxxxx> wrote:
Hello all thankx for your great help..
I think I forgot to mentioned that I'm running jobs on personal condor.
till now I hadn't made my own cluster so I jst wanna know is this job is executable on personal condor.


Greetings.


On 04/09/2013 12:13 AM, Andrew Kuelbs wrote:

 

I am running into an issue with my parallel universe jobs as well.  I have just installed condor as these instructions mentioned, with a few other alterations to customize my environment.

I am running RHEL 6.3 Server 64-bit on my master and compute nodes (about 50).  I am running Condor 7.8.7.  I have tried a few parallel mpi scripts including the one mentioned here in this thread. 

 

I come up with the following error in the log for that script in /var/log/codor/ShadowLog:

 

04/08/13 14:06:26 Initializing a PARALLEL shadow for job 50.0

04/08/13 14:06:26 (50.0) (24348): condor_write(): Socket closed when trying to write 37 bytes to daemon at <Server.IP.Cleaned:46094>, fd is 5

04/08/13 14:06:26 (50.0) (24348): Buf::write(): condor_write() failed

04/08/13 14:06:26 (50.0) (24348): ChildAliveMsg: failed to send DC_CHILDALIVE to parent daemon at <Server.IP.Cleaned:46094> (try 1 of 3): CEDAR:6002:failed to send EOM

04/08/13 14:06:27 (50.0) (24348): ERROR "Failed to get number of procs" at line 241 in file /slots/01/dir_65060/userdir/src/condor_shadow.V6.1/parallelshadow.cpp

 

 

 

Dedicated Server’s /etc/condor/condor_config.local

 

##  What machine is your central manager?

CONDOR_HOST = Server.Name.Cleaned

## Pool's short description

COLLECTOR_NAME = "Server.Name.Cleaned"

 

#FOR MPI and other Parallel Universe runs

Scheduler = "DedicatedScheduler@xxxxxxxxxxxxxxxxxxx"

 

##  When is this machine willing to start a job?

START = TRUE

##  When to suspend a job?

SUSPEND = FALSE

##  When to nicely stop a job?

##  (as opposed to killing it instantaneously)

PREEMPT = FALSE

##  When to instantaneously kill a preempting job

##  (e.g. if a job is in the pre-empting stage for too long)

KILL = FALSE

##  This macro determines what daemons the condor_master will start and keep its watchful eyes on.

##  The list is a comma or space separated list of subsystem names

 

DAEMON_LIST = COLLECTOR, MASTER, NEGOTIATOR, SCHEDD

 

- - - - - - - - - - - - - - - - - - - - -

All Compute Node’s /etc/condor/condor_config.local

 

##  What machine is your central manager?

CONDOR_HOST = Server.Name.Cleaned

 

## Pool's short description

COLLECTOR_NAME = "Server.Name.Cleaned"

 

#FOR Parallel MPI files to run

DedicatedScheduler = "DedicatedScheduler@xxxxxxxxxxxxxxxxxxx"

STARTD_ATTRS = $(STARTD_ATTRS), DedicatedScheduler

 

CONTINUE = True

WANT_SUSPEND = False

WANT_VACATE = False

RANK = Scheduler =?= $(DedicatedScheduler)

 

##  When is this machine willing to start a job?

START = TRUE

 

##  When to suspend a job?

SUSPEND = FALSE

 

##  When to nicely stop a job?

##  (as opposed to killing it instantaneously)

PREEMPT = FALSE

 

##  When to instantaneously kill a preempting job

##  (e.g. if a job is in the pre-empting stage for too long)

KILL = FALSE

 

##  This macro determines what daemons the condor_master will start and keep its watchful eyes on.

##  The list is a comma or space separated list of subsystem names

DAEMON_LIST = MASTER, STARTD

STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler

 

----------

 

 

Note the STARTD_EXPRS and the STARTD_ATTRS  the online manual referenced has STARTD_ATTRS  but the example file for Condor 7.8.7 says STARTD_EXPRS  so I put both in.  When I run a script it just hangs with the following in the job’s log:

 

 

007 (049.000.000) 04/08 13:39:56 Shadow exception!

                Failed to get number of procs

                0  -  Run Bytes Sent By Job

                0  -  Run Bytes Received By Job

...

007 (049.000.000) 04/08 13:40:00 Shadow exception!

                Failed to get number of procs

                0  -  Run Bytes Sent By Job

                0  -  Run Bytes Received By Job

...

009 (049.000.000) 04/08 13:40:00 Job was aborted by the user.

                via condor_rm (by user condor)

 

 

I tried just tossing in the example script as it was with the server name changed and simply get the same results.  A condor_status lists all the Compute Nodes in an unclaimed Idle state.  When I specify machine_count=X . X number of nodes become claimed but the job just sits idle.  Is there anyone who has any thoughts on this?

 

 

 

 

From: htcondor-users-bounces@xxxxxxxxxxx [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of David Hentchel
Sent: Monday, April 08, 2013 11:39 AM
To: jerome.leconte@xxxxxxxxxxxxxxx; HTCondor-Users Mail List
Subject: Re: [HTCondor-users] Job Submission in Parallel Universe

 

I just went through this as a beginner.  The key is to set up every host running a Start daemon to reference a "dedicated scheduler" in the pool, according the the instructions in the manual section labelled "3.12.8 HTCondor's Dedicated Scheduling".  You can also merge the examples in the condor install (see $INSTALL_DIR/etc/examples/condor_config.local.dedicated.resource).  All I had to change from that example was the hostname for the machine I wanted to use for the dedicated scheduler.

 

Hope this helps.

 

On Mon, Apr 8, 2013 at 8:45 AM, leconte <jerome.leconte@xxxxxxxxxxxxxxx> wrote:

Le 08/04/2013 09:53, Muak rules a écrit :

I'm submitting my first parallel universe job. which is basically a simple hello world problem.
I'm using CentOs 6.3 and had installed HTCondor using "yum install condor".
I'm not sure about about version of HTCondor. When I submit it the job goes on idle state.
Please help me out through this......Following is my submit description file.


universe=parallel
executable=mpi_hello_world
machine_count=1
log=hello.log
out=hello.out
queue

The following attachment contain my job.

                                       
   

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

hello Muak,

 I have testing your prog and submit file on my own test cluster. it work fine.

I suppose it is not your prog nor submit file but your configuration that cause the problem.

I've corrected only the line
out=hello.out
by
output=hello.out
my condor complains about it.

I don't know if I can correct your problem but can you post your cluster config  ?

Greetings


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/



 

--

David Hentchel

Performance Engineer

www.nuodb.com

(617) 803 - 1193



_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/



--

David Hentchel

Performance Engineer

www.nuodb.com

(617) 803 - 1193


_______________________________________________ HTCondor-users mailing list To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a subject: Unsubscribe You can also unsubscribe by visiting https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users The archives can be found at: https://lists.cs.wisc.edu/archive/htcondor-users/

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/



--

David Hentchel

Performance Engineer

www.nuodb.com

(617) 803 - 1193