[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Problems about condor glidein



 
Hi everyone,
 
I have installed condor-7.2.4, globus-4.2.1 and torque-2.3.7 on one machine to see how condor-G and condor glidein works. First I tried condor-G, everything worked fine, I could submit jobs to condor and through GRAM4 the jobs could run
on PBS. But when I tried the condor glidein command, it just blocked there.
 
[agrid@server condor-test]$ condor_glidein -count 1 -arch 7.3.2-i686-pc-Linux-2.4 -setup_jobmanager jobmanager-fork server.nova.cn/jobmanager-pbs
Running/verifying Glidein installation and setup...
Submitting Glidein setup job...
 
 
Following files were generated in the working directory:
 
[agrid@server condor-test]$ ll
Total 56
-rwxr-xr-x 1 agrid agrid 5086 10-27 17:58 glidein_remote_setup.8140
-rwxr-xr-x 1 agrid agrid 5086 10-27 17:59 glidein_remote_setup.8170
-rw-rw-r-- 1 agrid agrid    0 10-27 17:59 glidein_setup.error.8170
-rw-rw-r-- 1 agrid agrid  314 10-27 17:59 glidein_setup.log.8170
-rw-rw-r-- 1 agrid agrid    0 10-27 17:59 glidein_setup.output.8170
-rw-rw-r-- 1 agrid agrid  516 10-27 17:58 glidein_setup.submit.8140
-rw-rw-r-- 1 agrid agrid  516 10-27 17:59 glidein_setup.submit.8170
Here is the content of the glidein_setup.log.8170 file:
 
000 (037.000.000) 10/27 17:59:02 Job submitted from host: <10.10.3.159:57089>
...
020 (037.000.000) 10/27 17:59:15 Detected Down Globus Resource
    RM-Contact: server.nova.cn/jobmanager-fork
...
026 (037.000.000) 10/27 17:59:15 Detected Down Grid Resource
    GridResource: gt2 server.nova.cn/jobmanager-fork
 
I noticed that condor automatically set the grid resource as gt2. Is this right? Because what I am using is gt4. 
 
I aslo turned on the debug mode of gridmanager and got the following information in the log file:
 
10/27 17:59:07 ******************************************************
10/27 17:59:07 ** condor_gridmanager (CONDOR_GRIDMANAGER) STARTING UP
10/27 17:59:07 ** /opt/condor-7.2.4/sbin/condor_gridmanager
10/27 17:59:07 ** SubsystemInfo: name=GRIDMANAGER type=DAEMON(10) class=DAEMON(1)
10/27 17:59:07 ** Configuration: subsystem:GRIDMANAGER local:<NONE> class:DAEMON
10/27 17:59:07 ** $CondorVersion: 7.2.4 Jun 16 2009 BuildID: 159529 $
10/27 17:59:07 ** $CondorPlatform: I386-LINUX_RHEL5 $
10/27 17:59:07 ** PID = 8202
10/27 17:59:07 ** Log last touched 10/27 11:54:30
10/27 17:59:07 ******************************************************
10/27 17:59:07 Using config source: /opt/condor-7.2.4/etc/condor_config
10/27 17:59:07 Using local config sources: 
10/27 17:59:07    /opt/condor-7.2.4/local.server/condor_config.local
10/27 17:59:07 Running as root.  Enabling specialized core dump routines
10/27 17:59:07 DaemonCore: Command Socket at <10.10.3.159:34693>
10/27 17:59:07 Will use UDP to update collector server.nova.cn <10.10.3.159:9618>
10/27 17:59:07 [8202] Welcome to the all-singing, all dancing, "amazing" GridManager!
10/27 17:59:07 [8202] DaemonCore: in SendAliveToParent()
10/27 17:59:07 [8202] Initialized the following authorization table:
10/27 17:59:07 [8202] Authorizations yet to be resolved:
10/27 17:59:07 [8202] allow NEGOTIATOR:  */server.nova.cn */10.10.3.159
10/27 17:59:07 [8202] allow ADMINISTRATOR:  */server.nova.cn */10.10.3.159
10/27 17:59:07 [8202] allow OWNER:  */server.nova.cn */server.nova.cn */10.10.3.159 */10.10.3.159
10/27 17:59:07 [8202] DaemonCore: Leaving SendAliveToParent() - success
10/27 17:59:07 [8202] Checking proxies
10/27 17:59:10 [8202] Received ADD_JOBS signal
10/27 17:59:10 [8202] in doContactSchedd()
10/27 17:59:10 [8202] querying for new jobs
10/27 17:59:10 [8202] Using constraint ((Owner=?="agrid"&&JobUniverse==9)) && (Managed =!= "ScheddDone") && (((Matched =!= FA
LSE) && (JobStatus != 5)) || (Managed =?= "External"))
10/27 17:59:10 [8202] Using job type Globus for job 37.0
10/27 17:59:10 [8202] (37.0) SetJobLeaseTimers()
10/27 17:59:10 [8202] Found job 37.0 --- inserting
10/27 17:59:10 [8202] Fetched 1 new job ads from schedd
10/27 17:59:10 [8202] querying for removed/held jobs
10/27 17:59:10 [8202] Using constraint ((Owner=?="agrid"&&JobUniverse==9)) && ((Managed =!= "ScheddDone")) && (JobStatus == 3
 || JobStatus == 4 || (JobStatus == 5 && Managed =?= "External"))
10/27 17:59:10 [8202] Fetched 0 job ads from schedd
10/27 17:59:10 [8202] leaving doContactSchedd()
10/27 17:59:10 [8202] gahp server not up yet, delaying ping
10/27 17:59:10 [8202] *** UpdateLeases called
10/27 17:59:10 [8202]     Leases not supported, cancelling timer
10/27 17:59:10 [8202] grid_monitor for server.nova.cn:2119 entering CheckMonitor
10/27 17:59:10 [8202] GAHP server not initialized yet, not submitting grid_monitor now
10/27 17:59:10 [8202] (37.0) doEvaluateState called: gmState GM_INIT, globusState 32
10/27 17:59:10 [8202] Create_Process: using fast clone() to create child process.
10/27 17:59:10 [8202] GAHP server pid = 8206
10/27 17:59:10 [8202] GAHP server version: $GahpVersion: 1.0.16 Jun 16 2009 UW Gahp $
10/27 17:59:10 [8202] GAHP[8206] <- 'COMMANDS'
10/27 17:59:10 [8202] GAHP[8206] -> 'S' 'COMMANDS' 'GASS_SERVER_INIT' 'GRAM_CALLBACK_ALLOW' 'GRAM_ERROR_STRING' 'GRAM_JOB_CAL
LBACK_REGISTER' 'GRAM_JOB_CANCEL' 'GRAM_JOB_REQUEST' 'GRAM_JOB_SIGNAL' 'GRAM_JOB_STATUS' 'GRAM_PING' 'INITIALIZE_FROM_FILE' '
QUIT' 'RESULTS' 'VERSION' 'ASYNC_MODE_ON' 'ASYNC_MODE_OFF' 'RESPONSE_PREFIX' 'REFRESH_PROXY_FROM_FILE' 'CACHE_PROXY_FROM_FILE
' 'USE_CACHED_PROXY' 'UNCACHE_PROXY' 'GRAM_JOB_REFRESH_PROXY'
10/27 17:59:10 [8202] GAHP[8206] <- 'RESPONSE_PREFIX GAHP:'
10/27 17:59:10 [8202] GAHP[8206] -> 'S'
10/27 17:59:10 [8202] GAHP[8206] <- 'ASYNC_MODE_ON'
10/27 17:59:10 [8202] GAHP[8206] -> 'S'
10/27 17:59:10 [8202] GAHP[8206] <- 'INITIALIZE_FROM_FILE /tmp/condor_g_scratch.0xa457c48.16424/master_proxy.2'
10/27 17:59:10 [8202] GAHP[8206] -> 'S'
10/27 17:59:10 [8202] GAHP[8206] <- 'CACHE_PROXY_FROM_FILE 2 /tmp/condor_g_scratch.0xa457c48.16424/master_proxy.2'
10/27 17:59:10 [8202] GAHP[8206] -> 'S'
10/27 17:59:10 [8202] GAHP[8206] <- 'USE_CACHED_PROXY 2'
10/27 17:59:10 [8202] GAHP[8206] -> 'S'
10/27 17:59:10 [8202] GAHP[8206] <- 'CACHE_PROXY_FROM_FILE 1 /tmp/x509up_u502'
10/27 17:59:10 [8202] GAHP[8206] -> 'S'
10/27 17:59:10 [8202] GAHP[8206] <- 'GRAM_CALLBACK_ALLOW 2 0'
10/27 17:59:10 [8202] GAHP[8206] -> 'S' 'https://server.nova.cn:56844/'
10/27 17:59:10 [8202] GAHP[8206] <- 'GASS_SERVER_INIT 3 0'
10/27 17:59:10 [8202] GAHP[8206] -> 'S'
10/27 17:59:10 [8202] GAHP[8206] <- 'RESULTS'
10/27 17:59:10 [8202] GAHP[8206] -> 'R'
10/27 17:59:10 [8202] GAHP[8206] -> 'S' '1'
10/27 17:59:10 [8202] GAHP[8206] -> '3' '0' 'https://server.nova.cn:51333'
10/27 17:59:10 [8202] (37.0) gm state change: GM_INIT -> GM_START
10/27 17:59:10 [8202] (37.0) gm state change: GM_START -> GM_CLEAR_REQUEST
10/27 17:59:10 [8202] (37.0) gm state change: GM_CLEAR_REQUEST -> GM_UNSUBMITTED
10/27 17:59:10 [8202] (37.0) gm state change: GM_UNSUBMITTED -> GM_SUBMIT
10/27 17:59:10 [8202] Final RSL: &(rsl_substitution=(GRIDMANAGER_GASS_URL https://server.nova.cn:51333))(executable=$(GRIDMAN
AGER_GASS_URL)#'/home/agrid/condor-test/glidein_remote_setup.8170')(directory='/tmp')(arguments=$(HOME)#'/Condor_glidein' $(H
OME)#'/Condor_glidein/7.3.2-i686-pc-Linux-2.4' '7.3.2-i686-pc-Linux-2.4' $(HOME)#'/Condor_glidein/local' 'http://www.cs.wisc.
edu/condor/glidein/binaries' '0')(stdout=$(GLOBUS_CACHED_STDOUT))(stderr=$(GLOBUS_CACHED_STDERR))(file_stage_out=($(GLOBUS_CA
CHED_STDOUT) $(GRIDMANAGER_GASS_URL)#'/home/agrid/condor-test/glidein_setup.output.8170')($(GLOBUS_CACHED_STDERR) $(GRIDMANAG
ER_GASS_URL)#'/home/agrid/condor-test/glidein_setup.error.8170'))(proxy_timeout=240)(save_state=yes)(two_phase=600)(remote_io
_url=$(GRIDMANAGER_GASS_URL))
10/27 17:59:10 [8202] GAHP[8206] <- 'RESULTS'
10/27 17:59:10 [8202] GAHP[8206] -> 'S' '0'
10/27 17:59:10 [8202] (37.0) doEvaluateState called: gmState GM_SUBMIT, globusState 32
10/27 17:59:15 [8202] GAHP[8206] <- 'GRAM_PING 4 server.nova.cn:2119'
10/27 17:59:15 [8202] GAHP[8206] -> 'S'
10/27 17:59:15 [8202] grid_monitor for server.nova.cn:2119 entering CheckMonitor
10/27 17:59:15 [8202] grid_monitor for server.nova.cn:2119: first ping not done yet, will retry later
10/27 17:59:15 [8202] GAHP[8206] <- 'RESULTS'
10/27 17:59:15 [8202] GAHP[8206] -> 'R'
10/27 17:59:15 [8202] GAHP[8206] -> 'S' '1'
10/27 17:59:15 [8202] GAHP[8206] -> '4' '79'
10/27 17:59:15 [8202] resource server.nova.cn:2119 is now down
10/27 17:59:15 [8202] FileLock object is updating timestamp on: /home/agrid/condor-test/glidein_setup.log.8170
10/27 17:59:15 [8202] (37.0) Writing globus down record to user logfile
10/27 17:59:15 [8202] FileLock::obtain(1) - @1256637555.411181 lock on /home/agrid/condor-test/glidein_setup.log.8170 now WRI
TE
10/27 17:59:15 [8202] FileLock::obtain(2) - @1256637555.431371 lock on /home/agrid/condor-test/glidein_setup.log.8170 now UNL
OCKED
10/27 17:59:15 [8202] FileLock object is updating timestamp on: /home/agrid/condor-test/glidein_setup.log.8170
10/27 17:59:15 [8202] (37.0) Writing grid source down record to user logfile
10/27 17:59:15 [8202] FileLock::obtain(1) - @1256637555.431812 lock on /home/agrid/condor-test/glidein_setup.log.8170 now WRI
TE
10/27 17:59:15 [8202] FileLock::obtain(2) - @1256637555.437901 lock on /home/agrid/condor-test/glidein_setup.log.8170 now UNL
OCKED
10/27 17:59:15 [8202] in doContactSchedd()
10/27 17:59:15 [8202] querying for removed/held jobs
10/27 17:59:15 [8202] Using constraint ((Owner=?="agrid"&&JobUniverse==9)) && ((Managed =!= "ScheddDone")) && (JobStatus == 3
 || JobStatus == 4 || (JobStatus == 5 && Managed =?= "External"))
10/27 17:59:15 [8202] Fetched 0 job ads from schedd
10/27 17:59:15 [8202] Updating classad values for 37.0:
10/27 17:59:15 [8202]    GridResourceUnavailableTime = 1256637555
10/27 17:59:15 [8202]    GlobusResourceUnavailableTime = 1256637555
10/27 17:59:15 [8202] leaving doContactSchedd()
10/27 17:59:15 [8202] (37.0) doEvaluateState called: gmState GM_SUBMIT, globusState 32
10/27 17:59:20 [8202] grid_monitor for server.nova.cn:2119 entering CheckMonitor
10/27 17:59:20 [8202] GAHP[8206] <- 'GRAM_JOB_REQUEST 5 server.nova.cn:2119/jobmanager-fork NULL 1 &(executable=https://serve
r.nova.cn:51333/opt/condor-7.2.4/sbin/grid_monitor.sh)(stdout=https://server.nova.cn:51333/tmp/condor_g_scratch.0xa457c48.164
24/grid-monitor.server.nova.cn:2119.1/grid-monitor-log)(arguments='--dest-url="">
ch.0xa457c48.16424/grid-monitor.server.nova.cn:2119.1/grid-monitor-job-status')'
10/27 17:59:20 [8202] GAHP[8206] -> 'S'
10/27 17:59:20 [8202] GAHP[8206] <- 'RESULTS'
10/27 17:59:20 [8202] GAHP[8206] -> 'R'
10/27 17:59:20 [8202] GAHP[8206] -> 'S' '1'
10/27 17:59:20 [8202] GAHP[8206] -> '5' '12' 'NULL'
10/27 17:59:20 [8202] grid_monitor for server.nova.cn:2119 entering CheckMonitor
10/27 17:59:20 [8202] GAHP[8206] <- 'GRAM_ERROR_STRING 12'
10/27 17:59:20 [8202] GAHP[8206] -> 'S' 'the connection to the server failed (check host and port)'
10/27 17:59:20 [8202] grid_monitor job submit failed for resource server.nova.cn:2119, gram error 12 (the connection to the s
erver failed (check host and port))
10/27 17:59:20 [8202] Giving up on grid_monitor for site server.nova.cn:2119.  Will retry in 3600 seconds (60 minutes)
10/27 17:59:20 [8202] Stopping grid_monitor for resource server.nova.cn:2119
 
I think this log file indicated that the server.nova.cn:2119 is down, howerver I checked this port and got the following answer:
[agrid@server condor-test]$ netstat -nat | grep 2119
tcp        0      0 0.0.0.0:2119                0.0.0.0:*                   LISTEN
 
Any idea will be appreciated.
 
-Hailong
 
2009-10-27

***********************************************
* Hailong Yang, PhD. Candidate
* Sino-German Joint Software Institute,
* School of Computer Science&Engineering, Beihang University
* Phone: (86-010)82315908
* Email: hailong.yang1115@xxxxxxxxx
* Address: G413, New Main Building in Beihang University,
*              No.37 XueYuan Road,HaiDian District,
*              Beijing,P.R.China,100191
***********************************************