[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] How To TroubleShoot Flocking



Well, I guess I wasn't waiting long enough for the job to flock.  I
didn't realize I had to wait longer than a few minutes.  So, now the job
is getting 'flocked' to the other pool, but I notice a strange problem.

The job fails to run on the remote pool, saying it failed to execute
...condor_exec.exe.  This is a Linux machine submitting to another Linux
machine.  I'm not sure why it is trying to use condor_exec.exe instead
of just condor_exec.  Another strange thing I noticed is, if I submit
the same job on this remote pool, but from a machine locally to that
remote pool, it works fine.



John Alberts
Technical Assistant for EMS
alberts@xxxxxxxxxxxxxxxxxx
219-989-2083
CLO 332
http://public.xdi.org/=john.alberts


-----Original Message-----
From: condor-users-bounces@xxxxxxxxxxx
[mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Erik Paulson
Sent: Wednesday, July 05, 2006 2:55 PM
To: Condor-Users Mail List
Subject: Re: [Condor-users] How To TroubleShoot Flocking

On Wed, Jul 05, 2006 at 02:40:38PM -0500, John Alberts wrote:
> Hi.  I am trying to setup flocking between 2 condor pools.  1 pool I
> have complete control/access to, the other pool I can log in using ssh
> and submit jobs.  The administrator of the other pool is currently on
> vacation and said he has configured flocking to/from our pool.  I'm
> trying to test it, and it seems like flocking isn't working.
> 
>  
> 
> I was wondering how I can troubleshoot flocking to see what the
culprit
> is.  I already tried to submit a job whose requirements can only be
> fulfilled on the other pool.  Condor_status -analyze <jobid> shows
that
> all machines can't meet the requirements. 

1. I think you mean 'condor_q -analyze'

2. I'm not sure that condor_q -analyze works with remote pools.

> I have also run condor_status
> -pool <otherpoolname> and it properly displays all available machines
on
> the other pool.  I'm not sure what to check next.
> 

The next thing to check is to make sure that you're actually flocking
to the remote pool. When a schedd "flocks" to a remote pool, all it does
is send a ClassAd announcing that it has idle jobs to the remote pool.
You can check to see if the remote pool know that you have idle jobs
with

condor_status -pool remote.pool.central.manager -submitters

The schedd will not flock to the remote pool right away - it will wait
until
it has had a few negotiation cycles with the local pool before it 
decides to "increase the flock level". This usually happens within 
15 or 20 minutes of submtting a job that can't be satisifed in the local
pool.

-Erik

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with
a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at either
https://lists.cs.wisc.edu/archive/condor-users/
http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR