[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] can't get Condor-G job to run



On Fri, 3 Jun 2005, Jaime Frey wrote:

>On Jun 3, 2005, at 5:33 AM, Dr Ian C. Smith wrote:
>
>> --On 02 June 2005 14:50 -0500 Jaime Frey <jfrey@xxxxxxxxxxx> wrote:
>>
>>> On Jun 1, 2005, at 8:29 AM, Dr Ian C. Smith wrote:
>>>
>>> I'm trying to get Condor-G working and I've tried
>>> submitting an job similar the example in the guide:
>>>
>>> executable = hello.ksh
>>> globusscheduler = ulgsmp1.liv.ac.uk/jobmanager-fork
>>> universe = globus
>>> output = test.out
>>> log = test.log
>>> queue
>>>
>>>
>>> but it just remains in the idle state. The logfile
>>> (/tmp/GridmanagerLog.smithic) shows:
>>>
>>>
>>> 6/1 14:23:43 ******************************************************
>>> 6/1 14:23:43 ** condor_gridmanager (CONDOR_GRIDMANAGER) STARTING UP
>>> 6/1 14:23:43 ** /opt1/condor/sbin/condor_gridmanager
>>> 6/1 14:23:43 ** $CondorVersion: 6.6.7 Oct 11 2004 $
>>> 6/1 14:23:43 ** $CondorPlatform: SUN4X-SOLARIS29 $
>>> 6/1 14:23:43 ** PID = 9333
>>> 6/1 14:23:43 ******************************************************
>>> 6/1 14:23:43 Using config file: /etc/condor/condor_config
>>> 6/1 14:23:43 Using local config files:
>>> /opt1/condor/home/condor_config.local 6/1 14:23:43 DaemonCore:
>>> Command
>>> Socket at <138.253.100.177:60984> 6/1 14:23:43 [9333] GAHP server
>>> pid =
>>> 9334
>>> 6/1 14:23:46 [9333] DaemonCore: Command received via UDP from host
>>> <138.253.100.177:52097> 6/1 14:23:46 [9333] DaemonCore: received
>>> command
>>> 60000 (DC_RAISESIGNAL), calling handler (HandleSigCommand()) 6/1
>>> 14:23:46
>>> [9333] Found job 142108.0 --- inserting
>>> 6/1 14:23:46 [9333] Found job 142109.0 --- inserting
>>> 6/1 14:23:46 [9333] Found job 142110.0 --- inserting
>>> 6/1 14:23:46 [9333] (142110.0) doEvaluateState called: gmState
>>> GM_INIT,
>>> globusState 32 6/1 14:23:46 [9333] (142110.0) proxy not cached yet,
>>> waiting...
>>> 6/1 14:23:46 [9333] proxy near expiration or invalid, delaying ping
>>> 6/1 14:23:46 [9333] (142109.0) doEvaluateState called: gmState
>>> GM_INIT,
>>> globusState 32 6/1 14:23:46 [9333] (142109.0) proxy not cached yet,
>>> waiting...
>>> 6/1 14:23:46 [9333] (142108.0) doEvaluateState called: gmState
>>> GM_INIT,
>>> globusState 32 6/1 14:23:46 [9333] (142108.0) proxy not cached yet,
>>> waiting...
>>> 6/1 14:23:46 [9333] GAHP command 'CACHE_PROXY_FROM_FILE' failed:
>>> Failed
>>> to import credential maj=851968 min=5 6/1 14:23:46 [9333] ERROR "GAHP
>>> cache command failed!" at line 357 in file proxymanager.C
>>>
>>>
>>>
>>>
>>> When I use globus-job-run it's fine so the globus bit seems OK.
>>>
>>>
>>> Any ideas on what it going wrong ?
>>>
>>>
>>>
>>>
>>> What version of Globus did you create the proxy with, and what
>>> command
>>> did you execute (including command-line options)? Globus 4.0
>>> introduces a
>>> new proxy format that Condor may have trouble understanding.
>>>
>>>
>>
>> I'm still using GT2. I've used this on another host with condor-G and
>> it works OK. The command I used was:
>>
>>
>>> $ globus-job-run ulgsmp1 -s /home/qcl/smithic/.lfs/condor-g/hello.ksh
>>>
>>
>> I've run a series of globus integration tests and these work OK
>> apart from
>> gsissh and gsiftp. It look as though condor-g isn't even contacting
>> the remote gatekeeper.
>
>Condor-G is failing to acquire its local credentials (i.e. read the
>proxy). Can you turn on D_FULLDEBUG for the gridmanager, try again,
>and post the resulting log?

Jaime,

I've attached the log file. The proxy cert is in /tmp/x509up_u<MY_UID>
and has read permission just for me - is this OK. AFAIK globus
complains if you grant read permission for  any other users.

Thanks for looking at this.

-ian

PS I think you mentioned that there was a problem with GT4 proxies - is
  this a show stopper or is there a work around. It would be good to
  move to GT4 soon.

>
>+----------------------------------+---------------------------------+
>|            Jaime Frey            |  Public Split on Whether        |
>|        jfrey@xxxxxxxxxxx         |  Bush Is a Divider              |
>|  http://www.cs.wisc.edu/~jfrey/  |         -- CNN Scrolling Banner |
>+----------------------------------+---------------------------------+
>
>
>
6/6 11:51:58 [5869] GAHP <- 'RESULTS'
6/6 11:51:58 [5869] GAHP -> 'S' '0'
6/6 11:52:34 [5869] checkResources(): 1 resources, 0 are down
6/6 11:52:54 [5869] DaemonCore::IsPidAlive(): kill returned EPERM, assuming pid 3117 is alive.
6/6 11:52:58 [5869] GAHP <- 'RESULTS'
6/6 11:52:58 [5869] GAHP -> 'S' '0'
6/6 11:53:34 [5869] checkResources(): 1 resources, 0 are down
6/6 11:53:58 [5869] GAHP <- 'RESULTS'
6/6 11:53:58 [5869] GAHP -> 'S' '0'
6/6 11:54:34 [5869] checkResources(): 1 resources, 0 are down
6/6 11:54:54 [5869] DaemonCore::IsPidAlive(): kill returned EPERM, assuming pid 3117 is alive.
6/6 11:54:58 [5869] GAHP <- 'RESULTS'
6/6 11:54:58 [5869] GAHP -> 'S' '0'
6/6 11:55:25 [5869] Received ADD_JOBS signal
6/6 11:55:25 [5869] in doContactSchedd()
6/6 11:55:25 [5869] GRIDMANAGER_TIMEOUT_MULTIPLIER is undefined, using default value of 0
6/6 11:55:25 [5869] SEC_DEBUG_PRINT_KEYS is undefined, using default value of False
6/6 11:55:25 [5869] AUTHENTICATE_FS: used file /tmp/qmgr_LHL.EKOfg, status: 1
6/6 11:55:25 [5869] querying for new jobs
6/6 11:55:25 [5869] Using constraint ((Owner=?="smithic"&&x509userproxysubject=?="/C=UK/O=eScience/OU=Liverpool/L=CSD/CN=smith_ian/CN=proxy")) && JobUniverse == 9 && Matched =!= FALSE && JobStatus != 5 && JobStatus != 4 && (JobStatus != 3 || GlobusContactString != "X") && Managed =!= TRUE
6/6 11:55:25 [5869] Found job 142594.0 --- inserting
6/6 11:55:25 [5869] Fetched 1 new job ads from schedd
6/6 11:55:25 [5869] leaving doContactSchedd()
6/6 11:55:25 [5869] (142594.0) doEvaluateState called: gmState GM_INIT, globusState 32
6/6 11:55:25 [5869] (142594.0) proxy not cached yet, waiting...
6/6 11:55:34 [5869] checkResources(): 1 resources, 0 are down
6/6 11:55:58 [5869] GAHP <- 'RESULTS'
6/6 11:55:58 [5869] GAHP -> 'S' '0'
6/6 11:56:34 [5869] checkResources(): 1 resources, 0 are down
6/6 11:56:54 [5869] DaemonCore::IsPidAlive(): kill returned EPERM, assuming pid 3117 is alive.
6/6 11:56:58 [5869] GAHP <- 'RESULTS'
6/6 11:56:58 [5869] GAHP -> 'S' '0'
6/6 11:57:35 [5869] checkResources(): 1 resources, 0 are down
6/6 11:57:58 [5869] GAHP <- 'RESULTS'
6/6 11:57:58 [5869] GAHP -> 'S' '0'
6/6 11:58:35 [5869] checkResources(): 1 resources, 0 are down
6/6 11:58:54 [5869] DaemonCore::IsPidAlive(): kill returned EPERM, assuming pid 3117 is alive.
6/6 11:58:58 [5869] GAHP <- 'RESULTS'
6/6 11:58:58 [5869] GAHP -> 'S' '0'
6/6 11:59:35 [5869] checkResources(): 1 resources, 0 are down
6/6 11:59:58 [5869] GAHP <- 'RESULTS'
6/6 11:59:58 [5869] GAHP -> 'S' '0'
6/6 12:00:35 [5869] checkResources(): 1 resources, 0 are down
6/6 12:00:54 [5869] DaemonCore::IsPidAlive(): kill returned EPERM, assuming pid 3117 is alive.
6/6 12:00:58 [5869] GAHP <- 'RESULTS'
6/6 12:00:58 [5869] GAHP -> 'S' '0'
6/6 12:01:33 [5869] CheckProxies called
6/6 12:01:33 [5869]   (re)caching proxy 1
6/6 12:01:33 [5869] GAHP <- 'CACHE_PROXY_FROM_FILE 1 /tmp/x509up_u41269'
6/6 12:01:33 [5869] GAHP -> 'F' 'Failed to import credential maj=851968 min=5'
6/6 12:01:33 [5869] GAHP command 'CACHE_PROXY_FROM_FILE' failed: Failed to import credential maj=851968 min=5
6/6 12:01:33 [5869] ERROR "GAHP cache command failed!" at line 357 in file proxymanager.C
6/6 12:05:29 PASSWD_CACHE_REFRESH is undefined, using default value of 300

6/6 12:05:29 ******************************************************
6/6 12:05:29 ** condor_gridmanager (CONDOR_GRIDMANAGER) STARTING UP
6/6 12:05:29 ** /opt1/condor/sbin/condor_gridmanager
6/6 12:05:29 ** $CondorVersion: 6.6.7 Oct 11 2004 $
6/6 12:05:29 ** $CondorPlatform: SUN4X-SOLARIS29 $
6/6 12:05:29 ** PID = 9984
6/6 12:05:29 ******************************************************
6/6 12:05:29 Using config file: /etc/condor/condor_config
6/6 12:05:29 Using local config files: /opt1/condor/home/condor_config.local
6/6 12:05:29 DaemonCore: Command Socket at <138.253.100.177:64542>
6/6 12:05:29 SEC_DEFAULT_SESSION_DURATION is undefined, using default value of 3600
6/6 12:05:29 GRIDMANAGER_TIMEOUT_MULTIPLIER is undefined, using default value of 0
6/6 12:05:29 Welcome to the all-singing, all dancing, "amazing" GridManager!
6/6 12:05:29 [9984] GAHP server pid = 9985
6/6 12:05:29 [9984] GAHP server version: $GahpVersion: 1.0.12 Oct 11 2004 UW Gahp $
6/6 12:05:29 [9984] GAHP <- 'COMMANDS'
6/6 12:05:29 [9984] GAHP -> 'S' 'COMMANDS' 'GASS_SERVER_INIT' 'GRAM_CALLBACK_ALLOW' 'GRAM_ERROR_STRING' 'GRAM_JOB_CALLBACK_REGISTER' 'GRAM_JOB_CANCEL' 'GRAM_JOB_REQUEST' 'GRAM_JOB_SIGNAL' 'GRAM_JOB_STATUS' 'GRAM_PING' 'INITIALIZE_FROM_FILE' 'QUIT' 'RESULTS' 'VERSION' 'ASYNC_MODE_ON' 'ASYNC_MODE_OFF' 'RESPONSE_PREFIX' 'REFRESH_PROXY_FROM_FILE' 'CACHE_PROXY_FROM_FILE' 'USE_CACHED_PROXY' 'UNCACHE_PROXY' 'GRAM_JOB_REFRESH_PROXY' ''
6/6 12:05:29 [9984] GAHP <- 'RESPONSE_PREFIX GAHP:'
6/6 12:05:29 [9984] GAHP -> 'S'
6/6 12:05:29 [9984] GAHP <- 'ASYNC_MODE_ON'
6/6 12:05:29 [9984] GAHP -> 'S'
6/6 12:05:29 [9984] GRIDMANAGER_CONTACT_SCHEDD_DELAY is undefined, using default value of 5
6/6 12:05:29 [9984] GRIDMANAGER_JOB_PROBE_INTERVAL is undefined, using default value of 300
6/6 12:05:29 [9984] GRIDMANAGER_RESOURCE_PROBE_INTERVAL is undefined, using default value of 300
6/6 12:05:29 [9984] GRIDMANAGER_GAHP_CALL_TIMEOUT is undefined, using default value of 300
6/6 12:05:29 [9984] GRIDMANAGER_CONNECT_FAILURE_RETRY_COUNT is undefined, using default value of 3
6/6 12:05:29 [9984] ENABLE_GRID_MONITOR is undefined, using default value of False
6/6 12:05:29 [9984] GRIDMANAGER_CHECKPROXY_INTERVAL is undefined, using default value of 600
6/6 12:05:29 [9984] GRIDMANAGER_MINIMUM_PROXY_TIME is undefined, using default value of 180
6/6 12:05:29 [9984] GRIDMANAGER_MAX_PENDING_REQUESTS is undefined, using default value of 50
6/6 12:05:29 [9984] CheckProxies called
6/6 12:05:29 [9984]   will call CheckProxies again in 600 seconds
6/6 12:05:30 [9984] checkResources(): 0 resources, 0 are down
6/6 12:05:30 [9984] DaemonCore: in SendAliveToParent()
6/6 12:05:30 [9984] DaemonCore: attempting to connect to '<138.253.100.177:33163>'
6/6 12:05:30 [9984] GRIDMANAGER_TIMEOUT_MULTIPLIER is undefined, using default value of 0
6/6 12:05:34 [9984] Received ADD_JOBS signal
6/6 12:05:34 [9984] in doContactSchedd()
6/6 12:05:34 [9984] GRIDMANAGER_TIMEOUT_MULTIPLIER is undefined, using default value of 0
6/6 12:05:34 [9984] AUTHENTICATE_FS: used file /tmp/qmgr_VHLiFKOfg, status: 1
6/6 12:05:34 [9984] querying for new jobs
6/6 12:05:34 [9984] Using constraint ((Owner=?="smithic"&&x509userproxysubject=?="/C=UK/O=eScience/OU=Liverpool/L=CSD/CN=smith_ian/CN=proxy")) && JobUniverse == 9 && (Matched =!= FALSE || Managed =?= TRUE) && ((JobStatus == 5 || JobStatus == 4 || (JobStatus == 3 && GlobusContactString == "X")) && Managed =!= TRUE) == FALSE
6/6 12:05:34 [9984] Found job 142594.0 --- inserting
6/6 12:05:34 [9984] Fetched 1 new job ads from schedd
6/6 12:05:34 [9984] leaving doContactSchedd()
6/6 12:05:34 [9984] (142594.0) doEvaluateState called: gmState GM_INIT, globusState 32
6/6 12:05:34 [9984] (142594.0) proxy not cached yet, waiting...
6/6 12:05:34 [9984] CheckProxies called
6/6 12:05:34 [9984]   (re)caching proxy 1
6/6 12:05:34 [9984] GAHP <- 'CACHE_PROXY_FROM_FILE 1 /tmp/x509up_u41269'
6/6 12:05:34 [9984] GAHP -> 'F' 'Failed to import credential maj=851968 min=5'
6/6 12:05:34 [9984] GAHP command 'CACHE_PROXY_FROM_FILE' failed: Failed to import credential maj=851968 min=5
6/6 12:05:34 [9984] ERROR "GAHP cache command failed!" at line 357 in file proxymanager.C