[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Flocking 'twixt Condor pools



Hi Ian,

Ian Cottam wrote:
Can anyone help with debugging why flocking 'twixt two Condor pools isn't working please. (Condor 6.6.11 on all machines.)

We have a successful pool - mibpool1 - and we want to create similar on student clusters around the University. I have started with a new test pool of a couple of PCs in another building; all is well with it as an independent pool. FLOCK_TO and FLOCK_FROM variables are set correctly on both pool masters.

FLOCK_FROM is a property of a central manager (or "pool master", as you call it). However, FLOCK_TO is a property of a schedd, i.e. a submit machine. Hence, different submit nodes within the same pool can be configured to flock to different external pools, or the same ones in different order (flocking is attempted in the order listed in the FLOCK_TO field). Have your submit hosts have this set correctly?

On my main pool we always have a 100 to 200 jobs (mainly Java) nearly always queued up ready to run (Idle status in their queues); they never flock over. I can do condor_status -pool <the other pool master> -java and it says they are free and unclaimed.

I've checked with our network experts and there is no firewall or router settings causing problems.

I have taken one of our PCs out of the main pool and put it in its own - mibpooltest - to see if I can flock to that, so far no luck.

What do you see in the SchedLog of the submit host? After the job fails to be serviced by the local pool you should see something like:

<date> <time> (pid:<number>) Increasing flock level for <user>@<submit host> to 1.

Do you have anything like it? If not what does the following return when run on the submit host:

condor_config_val FLOCK_TO

Cheers,
Mark