[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Flocking



On Tue, Jun 12, 2007 at 01:21:49PM +0100, Kewley, J (John) wrote:
> Re: Flocking.
> * Can all your submit nodes in your first pool "see" (i.e. no firewalls in the way,
>   and not behind a NAT) all execute nodes in your other pool?
Yes, I get the full answer when I do a
---------------------------------------------
condor_status -pool <manager of second pool>
---------------------------------------------
on a submitter of pool A.

> * -remote is for direct submission to another pool, not for flocking.
Hmm, I see, but does it make sense to
----------------------------------------
condor_submit -pool <manager of pool B>
----------------------------------------
or should a blank 'condor_submit <submit-file>' lead to flocking if
pool A is completely booked out?

> * Check your HOSTALLOW values in pool B
>
Ahh! Do you mean flocking could work if I inlcude the submitters of pool A into
------------------------
HOSTALLOW_WRITE = ...
------------------------
At least I already have
--------------------------------------------------------------
HOSTALLOW_WRITE_COLLECTOR = $(HOSTALLOW_WRITE), $(FLOCK_FROM)
HOSTALLOW_WRITE_STARTD    = $(HOSTALLOW_WRITE), $(FLOCK_FROM)
HOSTALLOW_READ_COLLECTOR  = $(HOSTALLOW_READ), $(FLOCK_FROM)
HOSTALLOW_READ_STARTD     = $(HOSTALLOW_READ), $(FLOCK_FROM)
--------------------------------------------------------------
as by default and also mentioned in the manual.
 
> One test you could do is to name, say, the head node of the 2nd pool (assuming it
> can run jobs) in the REQUIREMENTS statement of a job on pool A. It then CANNOT
> run on poll A and, assuming all else is setup correctly, will run on pool B via flocking.
> If that works, name one of the workers in Pool B and try again. Don't use -remote for this.
> 
> Cheers
> 
> JK

How do I define such a requirement? Something like
------------------------------------------------- 
Requirements = TARGET.HOST == <manager of pool B>
------------------------------------------------- ?

Thanks for the fast help!

Urs
> 
> > -----Original Message-----
> > From: condor-users-bounces@xxxxxxxxxxx
> > [mailto:condor-users-bounces@xxxxxxxxxxx]On Behalf Of Urs Fitze
> > Sent: Tuesday, June 12, 2007 12:48 PM
> > To: condor-users@xxxxxxxxxxx
> > Subject: [Condor-users] Flocking
> > 
> > 
> > Hi,
> > 
> > I'm trying to set up flocking between 2 pools having 
> > different UID_DOMAIN and FILESYSTEM_DOMAIN.
> > I followed the (partially unclear) instructions from the 
> > manual '5.2 Connecting Condor Pools with Flocking'
> > i.e. by setting
> > --------------------------------
> > FLOCK_TO =  <manager of pool B> 
> > --------------------------------
> > on the submitter of pool A and setting
> > --------------------------------------------------------------
> > FLOCK_FROM =  <list of hosts containing submitter of pool A>.
> > --------------------------------------------------------------
> > After solving all firewall-issues I submitted a job(-cluster) 
> > on the submitter in pool A by:
> > --------------------------------------------------------------
> > ---------------------
> > condor_submit -remote <manager of B>  -pool <manager of B>  
> > <name of submit-file>
> > --------------------------------------------------------------
> > ---------------------
> > (when obmitting the '-remote ..' option the job would NEVER 
> > flock to B, even if there
> > were no ressources in A, why?)
> > This way I finally got some tracks in the logs of the manager 
> > of B, namely in 
> > '/scratch/condor/log/SchedLog':
> > --------------------------------------------------------------
> > ---------------------------------------------------
> > 6/12 12:17:07 (pid:31692) authenticate_self_gss: acquiring 
> > self credentials failed. Please check your Condor 
> > configuration file if this is a server process. Or the user 
> > environment variable if this is a user process.
> > 
> > GSS Major Status: General failure
> > GSS Minor Status Error Chain:
> > globus_gsi_gssapi: Error with GSI credential
> > globus_gsi_gssapi: Error with gss credential handle
> > globus_credential: Valid credentials could not be found in 
> > any of the possible locations specified by the credential 
> > search order.
> > Valid credentials could not be found in any of the possible 
> > locations specified by the credential search order.
> > 
> > Attempt 1
> > 
> > globus_credential: Error reading host credential
> > globus_sysconfig: Could not find a valid certificate file: 
> > The host cert could not be found in:
> > 1) env. var. X509_USER_CERT
> > 2) /etc/grid-security/hostcert.pem
> > 3) $GLOBUS_LOCATION/etc/hostcert.pem
> > 4) $HOME/.globus/hostcert.pem
> > 
> > The host key could not be found in:
> > 1) env. var. X509_USER_KEY
> > 2) /etc/grid-security/hostkey.pem
> > 3) $GLOBUS_LOCATION/etc/hostkey.pem
> > 4) $HOME/.globus/hostkey.pem
> > 
> > 
> > 
> > Attempt 2
> > 
> > globus_credential: Error reading proxy credential
> > globus_sysconfig: Could not find a valid proxy certificate 
> > file location
> > globus_sysconfig: Error with key filename
> > globus_sysconfig: File does not exist: /tmp/x509up_u0 is not 
> > a valid file
> > 
> > Attempt 3
> > 
> > globus_credential: Error reading user credential
> > globus_sysconfig: Error with certificate filename: The user 
> > cert could not be found in:
> > 1) env. var. X509_USER_CERT
> > 2) $HOME/.globus/usercert.pem
> > 3) $HOME/.globus/usercred.p12
> > 
> > 
> > 
> > 
> > 6/12 12:17:07 (pid:31692) AUTHENTICATE: no available 
> > authentication methods succeeded, failing!
> > 6/12 12:17:07 (pid:31692) SCHEDD: authentication failed: 
> > AUTHENTICATE:1003:Failed to authenticate with any 
> > method|AUTHENTICATE:1004:Failed to authenticate using 
> > GSI|GSI:5003:Failed to authenticate.  Globus is reporting 
> > error (851968:133).  There is probably a problem with your 
> > credentials.  (Did you run 
> > grid-proxy-init?)|AUTHENTICATE:1004:Failed to authenticate 
> > using KERBEROS|AUTHENTICATE:1004:Failed to authenticate using 
> > FS|FS:1004:Unable to lstat(/tmp/FS_XXX5hDIkK)
> > --------------------------------------------------------------
> > -----------------------------------------
> > What happened here? I wonder because in the Flocking chapter 
> > in the manual there is no
> > mentioning of 'credentials', 'authentification' etc...only 
> > the reference to 'file-transfer-mechanism'
> > contains some infos in this direction.
> > Btw. I got the above log both for vanilla and standard jobs and had
> > -----------------------------------
> > should_transfer_files = YES
> > when_to_transfer_output = ON_EXIT
> > -----------------------------------
> > in the submit-file for the vanilla job.
> > 
> > On possibly remarkable thing is that in the (global) 
> > config-file for pool A there is the line
> > -----------------------------------
> > AUTHENTICATION_METHODS = FS_REMOTE
> > -----------------------------------
> > while there is no such thing for pool B.
> > 
> > What else do I need to make flocking from A to B work?
> > 
> > Thanks for any help
> > 
> > Regards
> > 
> > Urs Fitze
> > 
> > _______________________________________________
> > Condor-users mailing list
> > To unsubscribe, send a message to 
> > condor-users-request@xxxxxxxxxxx with a
> > subject: Unsubscribe
> > You can also unsubscribe by visiting
> > https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> > 
> > The archives can be found at: 
> > https://lists.cs.wisc.edu/archive/condor-users/
> > 
> 
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> 
> The archives can be found at: 
> https://lists.cs.wisc.edu/archive/condor-users/
> 
> 
> !DSPAM:466e91be213406491211187!
>