[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Flocking



Re: Flocking.
* Can all your submit nodes in your first pool "see" (i.e. no firewalls in the way,
  and not behind a NAT) all execute nodes in your other pool?
* -remote is for direct submission to another pool, not for flocking.
* Check your HOSTALLOW values in pool B

One test you could do is to name, say, the head node of the 2nd pool (assuming it
can run jobs) in the REQUIREMENTS statement of a job on pool A. It then CANNOT
run on poll A and, assuming all else is setup correctly, will run on pool B via flocking.
If that works, name one of the workers in Pool B and try again. Don't use -remote for this.

Cheers

JK

> -----Original Message-----
> From: condor-users-bounces@xxxxxxxxxxx
> [mailto:condor-users-bounces@xxxxxxxxxxx]On Behalf Of Urs Fitze
> Sent: Tuesday, June 12, 2007 12:48 PM
> To: condor-users@xxxxxxxxxxx
> Subject: [Condor-users] Flocking
> 
> 
> Hi,
> 
> I'm trying to set up flocking between 2 pools having 
> different UID_DOMAIN and FILESYSTEM_DOMAIN.
> I followed the (partially unclear) instructions from the 
> manual '5.2 Connecting Condor Pools with Flocking'
> i.e. by setting
> --------------------------------
> FLOCK_TO =  <manager of pool B> 
> --------------------------------
> on the submitter of pool A and setting
> --------------------------------------------------------------
> FLOCK_FROM =  <list of hosts containing submitter of pool A>.
> --------------------------------------------------------------
> After solving all firewall-issues I submitted a job(-cluster) 
> on the submitter in pool A by:
> --------------------------------------------------------------
> ---------------------
> condor_submit -remote <manager of B>  -pool <manager of B>  
> <name of submit-file>
> --------------------------------------------------------------
> ---------------------
> (when obmitting the '-remote ..' option the job would NEVER 
> flock to B, even if there
> were no ressources in A, why?)
> This way I finally got some tracks in the logs of the manager 
> of B, namely in 
> '/scratch/condor/log/SchedLog':
> --------------------------------------------------------------
> ---------------------------------------------------
> 6/12 12:17:07 (pid:31692) authenticate_self_gss: acquiring 
> self credentials failed. Please check your Condor 
> configuration file if this is a server process. Or the user 
> environment variable if this is a user process.
> 
> GSS Major Status: General failure
> GSS Minor Status Error Chain:
> globus_gsi_gssapi: Error with GSI credential
> globus_gsi_gssapi: Error with gss credential handle
> globus_credential: Valid credentials could not be found in 
> any of the possible locations specified by the credential 
> search order.
> Valid credentials could not be found in any of the possible 
> locations specified by the credential search order.
> 
> Attempt 1
> 
> globus_credential: Error reading host credential
> globus_sysconfig: Could not find a valid certificate file: 
> The host cert could not be found in:
> 1) env. var. X509_USER_CERT
> 2) /etc/grid-security/hostcert.pem
> 3) $GLOBUS_LOCATION/etc/hostcert.pem
> 4) $HOME/.globus/hostcert.pem
> 
> The host key could not be found in:
> 1) env. var. X509_USER_KEY
> 2) /etc/grid-security/hostkey.pem
> 3) $GLOBUS_LOCATION/etc/hostkey.pem
> 4) $HOME/.globus/hostkey.pem
> 
> 
> 
> Attempt 2
> 
> globus_credential: Error reading proxy credential
> globus_sysconfig: Could not find a valid proxy certificate 
> file location
> globus_sysconfig: Error with key filename
> globus_sysconfig: File does not exist: /tmp/x509up_u0 is not 
> a valid file
> 
> Attempt 3
> 
> globus_credential: Error reading user credential
> globus_sysconfig: Error with certificate filename: The user 
> cert could not be found in:
> 1) env. var. X509_USER_CERT
> 2) $HOME/.globus/usercert.pem
> 3) $HOME/.globus/usercred.p12
> 
> 
> 
> 
> 6/12 12:17:07 (pid:31692) AUTHENTICATE: no available 
> authentication methods succeeded, failing!
> 6/12 12:17:07 (pid:31692) SCHEDD: authentication failed: 
> AUTHENTICATE:1003:Failed to authenticate with any 
> method|AUTHENTICATE:1004:Failed to authenticate using 
> GSI|GSI:5003:Failed to authenticate.  Globus is reporting 
> error (851968:133).  There is probably a problem with your 
> credentials.  (Did you run 
> grid-proxy-init?)|AUTHENTICATE:1004:Failed to authenticate 
> using KERBEROS|AUTHENTICATE:1004:Failed to authenticate using 
> FS|FS:1004:Unable to lstat(/tmp/FS_XXX5hDIkK)
> --------------------------------------------------------------
> -----------------------------------------
> What happened here? I wonder because in the Flocking chapter 
> in the manual there is no
> mentioning of 'credentials', 'authentification' etc...only 
> the reference to 'file-transfer-mechanism'
> contains some infos in this direction.
> Btw. I got the above log both for vanilla and standard jobs and had
> -----------------------------------
> should_transfer_files = YES
> when_to_transfer_output = ON_EXIT
> -----------------------------------
> in the submit-file for the vanilla job.
> 
> On possibly remarkable thing is that in the (global) 
> config-file for pool A there is the line
> -----------------------------------
> AUTHENTICATION_METHODS = FS_REMOTE
> -----------------------------------
> while there is no such thing for pool B.
> 
> What else do I need to make flocking from A to B work?
> 
> Thanks for any help
> 
> Regards
> 
> Urs Fitze
> 
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to 
> condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> 
> The archives can be found at: 
> https://lists.cs.wisc.edu/archive/condor-users/
>