Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Flocking 'twixt Condor pools

Date: Mon, 02 Apr 2007 09:50:54 +0100
From: Ian Cottam <ian.cottam@xxxxxxxxxxxxxxxx>
Subject: Re: [Condor-users] Flocking 'twixt Condor pools

Hi Mark,

I just ran a test where I submitted 60 jobs on our small test pool oftwo PCs but with flocking enabled to our main pool of 70 machines.I monitored both pools with condor_status. The small one accepted twojobs as expected; the main one set forty odd machines to status"matched" but they stayed that way for a few minutes and then went backto "unclaimed". Any ideas?

Thanks again for help!
-Ian


Mark Calleja wrote:

Hi Ian,

Ian Cottam wrote:
Can anyone help with debugging why flocking 'twixt two Condor poolsisn't working please. (Condor 6.6.11 on all machines.)
We have a successful pool - mibpool1 - and we want to create similar onstudent clusters around the University. I have started with a new testpool of a couple of PCs in another building; all is well with it as anindependent pool. FLOCK_TO and FLOCK_FROM variables are set correctly onboth pool masters.
FLOCK_FROM is a property of a central manager (or "pool master", as youcall it). However, FLOCK_TO is a property of a schedd, i.e. a submitmachine. Hence, different submit nodes within the same pool can beconfigured to flock to different external pools, or the same ones indifferent order (flocking is attempted in the order listed in theFLOCK_TO field). Have your submit hosts have this set correctly?
On my main pool we always have a 100 to 200 jobs (mainly Java) nearlyalways queued up ready to run (Idle status in their queues); they neverflock over. I can do condor_status -pool <the other pool master> -javaand it says they are free and unclaimed.
I've checked with our network experts and there is no firewall or routersettings causing problems.
I have taken one of our PCs out of the main pool and put it in its own -mibpooltest - to see if I can flock to that, so far no luck.
What do you see in the SchedLog of the submit host? After the job failsto be serviced by the local pool you should see something like:
<date> <time> (pid:<number>) Increasing flock level for <user>@<submithost> to 1.
Do you have anything like it? If not what does the following return whenrun on the submit host:
condor_config_val FLOCK_TO

Cheers,
Mark

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at either
https://lists.cs.wisc.edu/archive/condor-users/
http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR


--
Ian Cottam
Information Systems Manager
Manchester Interdisciplinary Biocentre
The John Garside Building (Room G.002)
The University of Manchester
http://www.manchester.ac.uk/mib
e: ian.cottam@xxxxxxxxxxxxxxxx
t: 0161 306 5198
m: 07856 849831
http://personalpages.manchester.ac.uk/staff/Ian.Cottam

Prev by Date: Re: [Condor-users] Windows Vista
Next by Date: [Condor-users] some fedora core 6 questions...
Previous by thread: Re: [Condor-users] Windows Vista
Next by thread: Re: [Condor-users] Flocking 'twixt Condor pools
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [Condor-users] Flocking 'twixt Condor pools