[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] How To TroubleShoot Flocking



Thank you for all of the information, especially the link to the pdf.  I was looking for something like this on the condor site, but I guess I missed it.  I didn’t realize condor_config_val existed.  I’ll give it a try tomorrow, when I resume troubleshooting.

 

Thanks again.

 

 

John Alberts
Technical Assistant for EMS
alberts@xxxxxxxxxxxxxxxxxx
219-989-2083
CLO 332
http://public.xdi.org/=john.alberts


From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Kewley, J (John)
Sent: Thursday, July 06, 2006 4:55 AM
To: Condor-Users Mail List
Subject: Re: [Condor-users] How To TroubleShoot Flocking

 

[don't treat below as gospel - I haven't flocked in a while so some things may have

 changed or I may have mis-spelled things]

 

There a few subtle things that can stop flocking working:

* set FLOCK_TO and FLOCK_FROM at both ends for a 2 way flock

* HOSTALLOW values may need to be changed to include these other machines

* If you have security enabled - then this might need to be made more flexible

  to include other authentication mechanisms

* Machines in other pool may be of a different ARCH or OpSys

* Your jobs may be setup to use a shared filestore (NFS for instance) which

  isn't available from the other pool.

 

You can use

condor_config_val -pool NODE_NAME -name NODE_NAME val

where val is one of

hostallow_write, hostallow_read, flock_to, flock_from

to see what values are set for the different machines

 

But the more usual culprits are firewalls.

 

Are there any firewalls between the pools? (or is one pool behind a NAT)

 

Remember that for jobs to flock, every submit node needs to be able to talk to every execute node

and vice versa over the fixed ports and upper port range, all over both tcp and udp.

 

If that is not the case, you'll have to relax the firewalls or use GCB.

 

See also

for more info on firewalls in a Condor Pool

 

Cheers

 

JK

-----Original Message-----
From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-bounces@xxxxxxxxxxx]On Behalf Of John Alberts
Sent: Wednesday, July 05, 2006 8:41 PM
To: Condor-Users Mail List
Subject: [Condor-users] How To TroubleShoot Flocking

Hi.  I am trying to setup flocking between 2 condor pools.  1 pool I have complete control/access to, the other pool I can log in using ssh and submit jobs.  The administrator of the other pool is currently on vacation and said he has configured flocking to/from our pool.  I’m trying to test it, and it seems like flocking isn’t working.

 

I was wondering how I can troubleshoot flocking to see what the culprit is.  I already tried to submit a job whose requirements can only be fulfilled on the other pool.  Condor_status –analyze <jobid> shows that all machines can’t meet the requirements.  I have also run condor_status –pool <otherpoolname> and it properly displays all available machines on the other pool.  I’m not sure what to check next.

 

Note: There is a firewall between the pools and our network admin has already configured the firewall to allow traffic between pools.

 

Thanks for any help.

 

 

John Alberts
Technical Assistant for EMS
alberts@xxxxxxxxxxxxxxxxxx
219-989-2083
CLO 332
http://public.xdi.org/=john.alberts