[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Problems about condor glidein



Hello Hailong,

The problem you are having is that Condor is assuming the grid type is "gt2" instead of "gt4". Currently, the way to adjust that is to use the -gensubmit option to condor_glidein. Then, instead of submitting jobs to Condor directly, condor_glidein will write the submit file that it would have used and then exit. You can then modify the submit file (changing the grid type from gt2 to gt4) and submit the job to condor yourself.

--Dan

hailong.yang1115 wrote:
Hi everyone, I have installed condor-7.2.4, globus-4.2.1 and torque-2.3.7 on one machine to see how condor-G and condor glidein works. First I tried condor-G, everything worked fine, I could submit jobs to condor and through GRAM4 the jobs could run on PBS. But when I tried the condor glidein command, it just blocked there. [agrid@server condor-test]$ condor_glidein -count 1 -arch 7.3.2-i686-pc-Linux-2.4 -setup_jobmanager jobmanager-fork server.nova.cn/jobmanager-pbs
Running/verifying Glidein installation and setup...
Submitting Glidein setup job...
Following files were generated in the working directory: [agrid@server condor-test]$ ll
Total 56
-rwxr-xr-x 1 agrid agrid 5086 10-27 17:58 glidein_remote_setup.8140
-rwxr-xr-x 1 agrid agrid 5086 10-27 17:59 glidein_remote_setup.8170
-rw-rw-r-- 1 agrid agrid    0 10-27 17:59 glidein_setup.error.8170
-rw-rw-r-- 1 agrid agrid  314 10-27 17:59 glidein_setup.log.8170
-rw-rw-r-- 1 agrid agrid    0 10-27 17:59 glidein_setup.output.8170
-rw-rw-r-- 1 agrid agrid  516 10-27 17:58 glidein_setup.submit.8140
-rw-rw-r-- 1 agrid agrid  516 10-27 17:59 glidein_setup.submit.8170
Here is the content of the glidein_setup.log.8170 file:
000 (037.000.000) 10/27 17:59:02 Job submitted from host: <10.10.3.159:57089>
...
020 (037.000.000) 10/27 17:59:15 Detected Down Globus Resource
    RM-Contact: server.nova.cn/jobmanager-fork
...
026 (037.000.000) 10/27 17:59:15 Detected Down Grid Resource
    GridResource: gt2 server.nova.cn/jobmanager-fork
I noticed that condor automatically set the grid resource as gt2. Is this right? Because what I am using is gt4. I aslo turned on the debug mode of gridmanager and got the following information in the log file: 10/27 17:59:07 ******************************************************
10/27 17:59:07 ** condor_gridmanager (CONDOR_GRIDMANAGER) STARTING UP
10/27 17:59:07 ** /opt/condor-7.2.4/sbin/condor_gridmanager
10/27 17:59:07 ** SubsystemInfo: name=GRIDMANAGER type=DAEMON(10) class=DAEMON(1)
10/27 17:59:07 ** Configuration: subsystem:GRIDMANAGER local:<NONE> class:DAEMON
10/27 17:59:07 ** $CondorVersion: 7.2.4 Jun 16 2009 BuildID: 159529 $
10/27 17:59:07 ** $CondorPlatform: I386-LINUX_RHEL5 $
10/27 17:59:07 ** PID = 8202
10/27 17:59:07 ** Log last touched 10/27 11:54:30
10/27 17:59:07 ******************************************************
10/27 17:59:07 Using config source: /opt/condor-7.2.4/etc/condor_config
10/27 17:59:07 Using local config sources: 10/27 17:59:07 /opt/condor-7.2.4/local.server/condor_config.local
10/27 17:59:07 Running as root.  Enabling specialized core dump routines
10/27 17:59:07 DaemonCore: Command Socket at <10.10.3.159:34693>
10/27 17:59:07 Will use UDP to update collector server.nova.cn <10.10.3.159:9618>
10/27 17:59:07 [8202] Welcome to the all-singing, all dancing, "amazing" GridManager!
10/27 17:59:07 [8202] DaemonCore: in SendAliveToParent()
10/27 17:59:07 [8202] Initialized the following authorization table:
10/27 17:59:07 [8202] Authorizations yet to be resolved:
10/27 17:59:07 [8202] allow NEGOTIATOR:  */server.nova.cn */10.10.3.159
10/27 17:59:07 [8202] allow ADMINISTRATOR:  */server.nova.cn */10.10.3.159
10/27 17:59:07 [8202] allow OWNER:  */server.nova.cn */server.nova.cn */10.10.3.159 */10.10.3.159
10/27 17:59:07 [8202] DaemonCore: Leaving SendAliveToParent() - success
10/27 17:59:07 [8202] Checking proxies
10/27 17:59:10 [8202] Received ADD_JOBS signal
10/27 17:59:10 [8202] in doContactSchedd()
10/27 17:59:10 [8202] querying for new jobs
10/27 17:59:10 [8202] Using constraint ((Owner=?="agrid"&&JobUniverse==9)) && (Managed =!= "ScheddDone") && (((Matched =!= FA
LSE) && (JobStatus != 5)) || (Managed =?= "External"))
10/27 17:59:10 [8202] Using job type Globus for job 37.0
10/27 17:59:10 [8202] (37.0) SetJobLeaseTimers()
10/27 17:59:10 [8202] Found job 37.0 --- inserting
10/27 17:59:10 [8202] Fetched 1 new job ads from schedd
10/27 17:59:10 [8202] querying for removed/held jobs
10/27 17:59:10 [8202] Using constraint ((Owner=?="agrid"&&JobUniverse==9)) && ((Managed =!= "ScheddDone")) && (JobStatus == 3
 || JobStatus == 4 || (JobStatus == 5 && Managed =?= "External"))
10/27 17:59:10 [8202] Fetched 0 job ads from schedd
10/27 17:59:10 [8202] leaving doContactSchedd()
10/27 17:59:10 [8202] gahp server not up yet, delaying ping
10/27 17:59:10 [8202] *** UpdateLeases called
10/27 17:59:10 [8202]     Leases not supported, cancelling timer
10/27 17:59:10 [8202] grid_monitor for server.nova.cn:2119 entering CheckMonitor
10/27 17:59:10 [8202] GAHP server not initialized yet, not submitting grid_monitor now
10/27 17:59:10 [8202] (37.0) doEvaluateState called: gmState GM_INIT, globusState 32
10/27 17:59:10 [8202] Create_Process: using fast clone() to create child process.
10/27 17:59:10 [8202] GAHP server pid = 8206
10/27 17:59:10 [8202] GAHP server version: $GahpVersion: 1.0.16 Jun 16 2009 UW Gahp $
10/27 17:59:10 [8202] GAHP[8206] <- 'COMMANDS'
10/27 17:59:10 [8202] GAHP[8206] -> 'S' 'COMMANDS' 'GASS_SERVER_INIT' 'GRAM_CALLBACK_ALLOW' 'GRAM_ERROR_STRING' 'GRAM_JOB_CAL
LBACK_REGISTER' 'GRAM_JOB_CANCEL' 'GRAM_JOB_REQUEST' 'GRAM_JOB_SIGNAL' 'GRAM_JOB_STATUS' 'GRAM_PING' 'INITIALIZE_FROM_FILE' '
QUIT' 'RESULTS' 'VERSION' 'ASYNC_MODE_ON' 'ASYNC_MODE_OFF' 'RESPONSE_PREFIX' 'REFRESH_PROXY_FROM_FILE' 'CACHE_PROXY_FROM_FILE
' 'USE_CACHED_PROXY' 'UNCACHE_PROXY' 'GRAM_JOB_REFRESH_PROXY'
10/27 17:59:10 [8202] GAHP[8206] <- 'RESPONSE_PREFIX GAHP:'
10/27 17:59:10 [8202] GAHP[8206] -> 'S'
10/27 17:59:10 [8202] GAHP[8206] <- 'ASYNC_MODE_ON'
10/27 17:59:10 [8202] GAHP[8206] -> 'S'
10/27 17:59:10 [8202] GAHP[8206] <- 'INITIALIZE_FROM_FILE /tmp/condor_g_scratch.0xa457c48.16424/master_proxy.2'
10/27 17:59:10 [8202] GAHP[8206] -> 'S'
10/27 17:59:10 [8202] GAHP[8206] <- 'CACHE_PROXY_FROM_FILE 2 /tmp/condor_g_scratch.0xa457c48.16424/master_proxy.2'
10/27 17:59:10 [8202] GAHP[8206] -> 'S'
10/27 17:59:10 [8202] GAHP[8206] <- 'USE_CACHED_PROXY 2'
10/27 17:59:10 [8202] GAHP[8206] -> 'S'
10/27 17:59:10 [8202] GAHP[8206] <- 'CACHE_PROXY_FROM_FILE 1 /tmp/x509up_u502'
10/27 17:59:10 [8202] GAHP[8206] -> 'S'
10/27 17:59:10 [8202] GAHP[8206] <- 'GRAM_CALLBACK_ALLOW 2 0'
10/27 17:59:10 [8202] GAHP[8206] -> 'S' 'https://server.nova.cn:56844/'
10/27 17:59:10 [8202] GAHP[8206] <- 'GASS_SERVER_INIT 3 0'
10/27 17:59:10 [8202] GAHP[8206] -> 'S'
10/27 17:59:10 [8202] GAHP[8206] <- 'RESULTS'
10/27 17:59:10 [8202] GAHP[8206] -> 'R'
10/27 17:59:10 [8202] GAHP[8206] -> 'S' '1'
10/27 17:59:10 [8202] GAHP[8206] -> '3' '0' 'https://server.nova.cn:51333'
10/27 17:59:10 [8202] (37.0) gm state change: GM_INIT -> GM_START
10/27 17:59:10 [8202] (37.0) gm state change: GM_START -> GM_CLEAR_REQUEST
10/27 17:59:10 [8202] (37.0) gm state change: GM_CLEAR_REQUEST -> GM_UNSUBMITTED
10/27 17:59:10 [8202] (37.0) gm state change: GM_UNSUBMITTED -> GM_SUBMIT
10/27 17:59:10 [8202] Final RSL: &(rsl_substitution=(GRIDMANAGER_GASS_URL https://server.nova.cn:51333))(executable=$(GRIDMAN
AGER_GASS_URL)#'/home/agrid/condor-test/glidein_remote_setup.8170')(directory='/tmp')(arguments=$(HOME)#'/Condor_glidein' $(H
OME)#'/Condor_glidein/7.3.2-i686-pc-Linux-2.4' '7.3.2-i686-pc-Linux-2.4' $(HOME)#'/Condor_glidein/local' 'http://www.cs.wisc.
edu/condor/glidein/binaries' '0')(stdout=$(GLOBUS_CACHED_STDOUT))(stderr=$(GLOBUS_CACHED_STDERR))(file_stage_out=($(GLOBUS_CA
CHED_STDOUT) $(GRIDMANAGER_GASS_URL)#'/home/agrid/condor-test/glidein_setup.output.8170')($(GLOBUS_CACHED_STDERR) $(GRIDMANAG
ER_GASS_URL)#'/home/agrid/condor-test/glidein_setup.error.8170'))(proxy_timeout=240)(save_state=yes)(two_phase=600)(remote_io
_url=$(GRIDMANAGER_GASS_URL))
10/27 17:59:10 [8202] GAHP[8206] <- 'RESULTS'
10/27 17:59:10 [8202] GAHP[8206] -> 'S' '0'
10/27 17:59:10 [8202] (37.0) doEvaluateState called: gmState GM_SUBMIT, globusState 32
10/27 17:59:15 [8202] GAHP[8206] <- 'GRAM_PING 4 server.nova.cn:2119'
10/27 17:59:15 [8202] GAHP[8206] -> 'S'
10/27 17:59:15 [8202] grid_monitor for server.nova.cn:2119 entering CheckMonitor
10/27 17:59:15 [8202] grid_monitor for server.nova.cn:2119: first ping not done yet, will retry later
10/27 17:59:15 [8202] GAHP[8206] <- 'RESULTS'
10/27 17:59:15 [8202] GAHP[8206] -> 'R'
10/27 17:59:15 [8202] GAHP[8206] -> 'S' '1'
10/27 17:59:15 [8202] GAHP[8206] -> '4' '79'
10/27 17:59:15 [8202] resource server.nova.cn:2119 is now down
10/27 17:59:15 [8202] FileLock object is updating timestamp on: /home/agrid/condor-test/glidein_setup.log.8170
10/27 17:59:15 [8202] (37.0) Writing globus down record to user logfile
10/27 17:59:15 [8202] FileLock::obtain(1) - @1256637555.411181 lock on /home/agrid/condor-test/glidein_setup.log.8170 now WRI
TE
10/27 17:59:15 [8202] FileLock::obtain(2) - @1256637555.431371 lock on /home/agrid/condor-test/glidein_setup.log.8170 now UNL
OCKED
10/27 17:59:15 [8202] FileLock object is updating timestamp on: /home/agrid/condor-test/glidein_setup.log.8170
10/27 17:59:15 [8202] (37.0) Writing grid source down record to user logfile
10/27 17:59:15 [8202] FileLock::obtain(1) - @1256637555.431812 lock on /home/agrid/condor-test/glidein_setup.log.8170 now WRI
TE
10/27 17:59:15 [8202] FileLock::obtain(2) - @1256637555.437901 lock on /home/agrid/condor-test/glidein_setup.log.8170 now UNL
OCKED
10/27 17:59:15 [8202] in doContactSchedd()
10/27 17:59:15 [8202] querying for removed/held jobs
10/27 17:59:15 [8202] Using constraint ((Owner=?="agrid"&&JobUniverse==9)) && ((Managed =!= "ScheddDone")) && (JobStatus == 3
 || JobStatus == 4 || (JobStatus == 5 && Managed =?= "External"))
10/27 17:59:15 [8202] Fetched 0 job ads from schedd
10/27 17:59:15 [8202] Updating classad values for 37.0:
10/27 17:59:15 [8202]    GridResourceUnavailableTime = 1256637555
10/27 17:59:15 [8202]    GlobusResourceUnavailableTime = 1256637555
10/27 17:59:15 [8202] leaving doContactSchedd()
10/27 17:59:15 [8202] (37.0) doEvaluateState called: gmState GM_SUBMIT, globusState 32
10/27 17:59:20 [8202] grid_monitor for server.nova.cn:2119 entering CheckMonitor
10/27 17:59:20 [8202] GAHP[8206] <- 'GRAM_JOB_REQUEST 5 server.nova.cn:2119/jobmanager-fork NULL 1 &(executable=https://serve
r.nova.cn:51333/opt/condor-7.2.4/sbin/grid_monitor.sh)(stdout=https://server.nova.cn:51333/tmp/condor_g_scratch.0xa457c48.164
24/grid-monitor.server.nova.cn:2119.1/grid-monitor-log)(arguments='--dest-url=https://server.nova.cn:51333/tmp/condor_g_scrat
ch.0xa457c48.16424/grid-monitor.server.nova.cn:2119.1/grid-monitor-job-status')'
10/27 17:59:20 [8202] GAHP[8206] -> 'S'
10/27 17:59:20 [8202] GAHP[8206] <- 'RESULTS'
10/27 17:59:20 [8202] GAHP[8206] -> 'R'
10/27 17:59:20 [8202] GAHP[8206] -> 'S' '1'
10/27 17:59:20 [8202] GAHP[8206] -> '5' '12' 'NULL'
10/27 17:59:20 [8202] grid_monitor for server.nova.cn:2119 entering CheckMonitor
10/27 17:59:20 [8202] GAHP[8206] <- 'GRAM_ERROR_STRING 12'
10/27 17:59:20 [8202] GAHP[8206] -> 'S' 'the connection to the server failed (check host and port)'
10/27 17:59:20 [8202] grid_monitor job submit failed for resource server.nova.cn:2119, gram error 12 (the connection to the s
erver failed (check host and port))
10/27 17:59:20 [8202] Giving up on grid_monitor for site server.nova.cn:2119.  Will retry in 3600 seconds (60 minutes)
10/27 17:59:20 [8202] Stopping grid_monitor for resource server.nova.cn:2119
I think this log file indicated that the server.nova.cn:2119 is down, howerver I checked this port and got the following answer:
[agrid@server condor-test]$ netstat -nat | grep 2119
tcp        0      0 0.0.0.0:2119                0.0.0.0:*                   LISTEN
Any idea will be appreciated. -Hailong 2009-10-27
------------------------------------------------------------------------
***********************************************
* Hailong Yang, PhD. Candidate
* Sino-German Joint Software Institute,
* School of Computer Science&Engineering, Beihang University
* Phone: (86-010)82315908
* Email: hailong.yang1115@xxxxxxxxx <mailto:hailong.yang1115@xxxxxxxxx>
* Address: G413, New Main Building in Beihang University,
*              No.37 XueYuan Road,HaiDian District,
*              Beijing,P.R.China,100191
***********************************************
------------------------------------------------------------------------

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at: https://lists.cs.wisc.edu/archive/condor-users/