Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Flocking 'twixt Condor pools

Date: Fri, 30 Mar 2007 16:05:25 +0100
From: Mark Calleja <M.Calleja@xxxxxxxxxxxxxxx>
Subject: Re: [Condor-users] Flocking 'twixt Condor pools

Hi Ian,

Ian Cottam wrote:

Can anyone help with debugging why flocking 'twixt two Condor poolsisn't working please. (Condor 6.6.11 on all machines.)
We have a successful pool - mibpool1 - and we want to create similar onstudent clusters around the University. I have started with a new testpool of a couple of PCs in another building; all is well with it as anindependent pool. FLOCK_TO and FLOCK_FROM variables are set correctly onboth pool masters.

FLOCK_FROM is a property of a central manager (or "pool master", as youcall it). However, FLOCK_TO is a property of a schedd, i.e. a submitmachine. Hence, different submit nodes within the same pool can beconfigured to flock to different external pools, or the same ones indifferent order (flocking is attempted in the order listed in theFLOCK_TO field). Have your submit hosts have this set correctly?

On my main pool we always have a 100 to 200 jobs (mainly Java) nearlyalways queued up ready to run (Idle status in their queues); they neverflock over. I can do condor_status -pool <the other pool master> -javaand it says they are free and unclaimed.
I've checked with our network experts and there is no firewall or routersettings causing problems.
I have taken one of our PCs out of the main pool and put it in its own -mibpooltest - to see if I can flock to that, so far no luck.

What do you see in the SchedLog of the submit host? After the job failsto be serviced by the local pool you should see something like:

<date> <time> (pid:<number>) Increasing flock level for <user>@<submithost> to 1.

Do you have anything like it? If not what does the following return whenrun on the submit host:


condor_config_val FLOCK_TO

Cheers,
Mark

References:
- [Condor-users] Flocking 'twixt Condor pools
  - From: Ian Cottam

Prev by Date: [Condor-users] Segfault when resuming from checkpoint
Next by Date: Re: [Condor-users] Problem with MPI universe job
Previous by thread: [Condor-users] Flocking 'twixt Condor pools
Next by thread: [Condor-users] Segfault when resuming from checkpoint
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [Condor-users] Flocking 'twixt Condor pools