[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] How To TroubleShoot Flocking

Thanks to everyone who has responded trying to help me with this problem.  I've tried some of the suggestions and am still having the problem.  Here is what I have done so far.
I am submitting a simple job named testlinux3.sub with the following contents:
   Executable     = /bin/hostname
   Requirements    = UidDomain == "condor.calumet.purdue.edu" && Arch == "X86_64"
   Universe       = vanilla
   transfer_files = ALWAYS
   Output         = hostname3.out
   Log            = hostname3.log

I use condor_submit testlinux3.sub to submit the job and it goes in the queue.  It sits in the queue for 30 minutes and then it flocks to condor.calumet.purdue.edu as expected; however, I immediately start getting shadow errors.  At this point the log shows: (ip's have been omitted to protect the guilty :) )
   000 (251318.000.000) 07/06 16:19:55 Job submitted from host: <x.x.x.x:57608>
   001 (251318.000.000) 07/06 16:50:05 Job executing on host: <x.x.x.x:23601>
   007 (251318.000.000) 07/06 16:50:13 Shadow exception!
           Error from starter on vm1@xxxxxxxxxxxxxxxxxxxxxxxxx: Failed to execute '/usr/local/condor/home/execute/dir_14129/condor_exec.exe condor_exec.exe': No such file or directory
           0  -  Run Bytes Sent By Job
           10740  -  Run Bytes Received By Job

Permissions on /usr/local/condor/home/execute are:
   drwxrwxrwt  2 root root 4.0K Jul  6 15:15 execute

There is no other file or directory inside the execute directory.  Condor runs as root on this server.  Also, I have configured this server to use Lowport: 23410 and Highport: 23914.  As you can see from the log above, it appears to be in the proper range.
What else can I do to check this?
Thanks again.
John Alberts
Technical Assistant for EMS
CLO 332


From: condor-users-bounces@xxxxxxxxxxx on behalf of Dan Bradley
Sent: Thu 7/6/2006 9:25 AM
To: Condor-Users Mail List
Subject: Re: [Condor-users] How To TroubleShoot Flocking

By the way: the reference to "condor_exec.exe" is expected. This is the
name Condor runs the user's executable as (i.e. argv[0]). Failure to
execute the job is most often the result of files not being accessible
from the execute node. I assume this is a vanilla universe job. What
file-transfer settings are you using?


Kewley, J (John) wrote:

> [don't treat below as gospel - I haven't flocked in a while so some
> things may have
> changed or I may have mis-spelled things]
> There a few subtle things that can stop flocking working:
> * set FLOCK_TO and FLOCK_FROM at both ends for a 2 way flock
> * HOSTALLOW values may need to be changed to include these other machines
> * If you have security enabled - then this might need to be made more
> flexible
> to include other authentication mechanisms
> * Machines in other pool may be of a different ARCH or OpSys
> * Your jobs may be setup to use a shared filestore (NFS for instance)
> which
> isn't available from the other pool.
> You can use
> condor_config_val -pool NODE_NAME -name NODE_NAME val
> where val is one of
> hostallow_write, hostallow_read, flock_to, flock_from
> to see what values are set for the different machines
> But the more usual culprits are firewalls.
> Are there any firewalls between the pools? (or is one pool behind a NAT)
> Remember that for jobs to flock, every submit node needs to be able to
> talk to every execute node
> and vice versa over the fixed ports and upper port range, all over
> both tcp and udp.
> If that is not the case, you'll have to relax the firewalls or use GCB.
> See also
> http://www.allhands.org.uk/2005/proceedings/papers/431.pdf
> for more info on firewalls in a Condor Pool
> Cheers
> JK
>     -----Original Message-----
>     *From:* condor-users-bounces@xxxxxxxxxxx
>     [mailto:condor-users-bounces@xxxxxxxxxxx]*On Behalf Of *John Alberts
>     *Sent:* Wednesday, July 05, 2006 8:41 PM
>     *To:* Condor-Users Mail List
>     *Subject:* [Condor-users] How To TroubleShoot Flocking
>     Hi. I am trying to setup flocking between 2 condor pools. 1 pool I
>     have complete control/access to, the other pool I can log in using
>     ssh and submit jobs. The administrator of the other pool is
>     currently on vacation and said he has configured flocking to/from
>     our pool. I'm trying to test it, and it seems like flocking isn't
>     working.
>     I was wondering how I can troubleshoot flocking to see what the
>     culprit is. I already tried to submit a job whose requirements can
>     only be fulfilled on the other pool. Condor_status -analyze
>     <jobid> shows that all machines can't meet the requirements. I
>     have also run condor_status -pool <otherpoolname> and it properly
>     displays all available machines on the other pool. I'm not sure
>     what to check next.
>     Note: There is a firewall between the pools and our network admin
>     has already configured the firewall to allow traffic between pools.
>     Thanks for any help.
>     John Alberts
>     Technical Assistant for EMS
>     alberts@xxxxxxxxxxxxxxxxxx <mailto:alberts@xxxxxxxxxxxxxxxxxx>
>     219-989-2083
>     CLO 332
>     http://public.xdi.org/=john.alberts
>Condor-users mailing list
>To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
>subject: Unsubscribe
>You can also unsubscribe by visiting
>The archives can be found at either
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting

The archives can be found at either