[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Flocking 'twixt Condor pools



Hi Mark,
I just ran a test where I submitted 60 jobs on our small test pool of two PCs but with flocking enabled to our main pool of 70 machines. I monitored both pools with condor_status. The small one accepted two jobs as expected; the main one set forty odd machines to status "matched" but they stayed that way for a few minutes and then went back to "unclaimed". Any ideas?
Thanks again for help!
-Ian


Mark Calleja wrote:
Hi Ian,

Ian Cottam wrote:
Can anyone help with debugging why flocking 'twixt two Condor pools isn't working please. (Condor 6.6.11 on all machines.)

We have a successful pool - mibpool1 - and we want to create similar on student clusters around the University. I have started with a new test pool of a couple of PCs in another building; all is well with it as an independent pool. FLOCK_TO and FLOCK_FROM variables are set correctly on both pool masters.

FLOCK_FROM is a property of a central manager (or "pool master", as you call it). However, FLOCK_TO is a property of a schedd, i.e. a submit machine. Hence, different submit nodes within the same pool can be configured to flock to different external pools, or the same ones in different order (flocking is attempted in the order listed in the FLOCK_TO field). Have your submit hosts have this set correctly?

On my main pool we always have a 100 to 200 jobs (mainly Java) nearly always queued up ready to run (Idle status in their queues); they never flock over. I can do condor_status -pool <the other pool master> -java and it says they are free and unclaimed.

I've checked with our network experts and there is no firewall or router settings causing problems.

I have taken one of our PCs out of the main pool and put it in its own - mibpooltest - to see if I can flock to that, so far no luck.

What do you see in the SchedLog of the submit host? After the job fails to be serviced by the local pool you should see something like:

<date> <time> (pid:<number>) Increasing flock level for <user>@<submit host> to 1.

Do you have anything like it? If not what does the following return when run on the submit host:

condor_config_val FLOCK_TO

Cheers,
Mark

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at either
https://lists.cs.wisc.edu/archive/condor-users/
http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR

--
Ian Cottam
Information Systems Manager
Manchester Interdisciplinary Biocentre
The John Garside Building (Room G.002)
The University of Manchester
http://www.manchester.ac.uk/mib
e: ian.cottam@xxxxxxxxxxxxxxxx
t: 0161 306 5198
m: 07856 849831
http://personalpages.manchester.ac.uk/staff/Ian.Cottam