[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Trying to run condor_glidein on the National Grid Service



Hi Dan,

Yes that's the right address (I substituted the real address with 129.130.131.132). Yes, there is a firewall on the HPC.

I assume that may be the reason why the connection cannot be established.

Basically, I know the firewall of the HPC allows to connect outside on port 80. If I was to run the collector on port 80, would that be OK ? (Some HPC on the NGS only allow connections through 443, so I may need to redirect connections if I was doing something like that... )

Currently, I am trying to set up the GCB, but I have issues with it.

I added GCB_BROKER to the daemon list in /home/condor/condor_config.local (DAEMON_LIST = MASTER, COLLECTOR, NEGOTIATOR, STARTD, SCHEDD, GCB_BROKER) .
I also appended the following lines to /home/condor/condor_config.local


GCB_BROKER = $(RELEASE_DIR)/libexec/gcb_broker
GCB_RELAY = $(RELEASE_DIR)/libexec/gcb_relay_server
GCB_BROKER_ENV =
GCB_BROKER_ENV = $(GCB_BROKER_ENV);GCB_RELAY_SERVER=$(GCB_RELAY)
GCB_BROKER_ENV = $(GCB_BROKER_ENV);GCB_LOG_DIR=$(LOG)
GCB_BROKER_ENVIRONMENT = $(GCB_BROKER_ENV)
GCB_BROKER_IP = $(ip_address)
GCB_BROKER_ARGS = -i $(GCB_BROKER_IP)
NET_REMAP_ENABLE = true
NET_REMAP_SERVICE = GCB
NET_REMAP_INAGENT = 129.130.131.132
NET_REMAP_ROUTE = /home/condor/condor_routetable.txt
BIND_ALL_INTERFACES = true

I also wrote a "route table" :

[me@mycomputer ~]$ cat /home/condor/condor_routetable.txt
129.11.27.0/24 GCB
*/0 direct

But immediately after I started Condor, I read some warnings in the log files (especially in the CollectorLog ) and errors in SchedLog and StartLog:

MasterLog:
10/16 21:28:54 ******************************************************
10/16 21:28:54 ** condor_master (CONDOR_MASTER) STARTING UP
10/16 21:28:54 ** /opt/condor-release-6.8.8/sbin/condor_master
10/16 21:28:54 ** $CondorVersion: 6.8.8 Dec 19 2007 $
10/16 21:28:54 ** $CondorPlatform: I386-LINUX_RHEL3 $
10/16 21:28:54 ** PID = 5221
10/16 21:28:54 ** Log last touched time unavailable (No such file or directory)
10/16 21:28:54 ******************************************************
10/16 21:28:54 Using config source: /home/condor/condor_config
10/16 21:28:54 Using local config sources:
10/16 21:28:54    /home/condor/condor_config.local
10/16 21:28:54 DaemonCore: Command Socket at <129.130.131.132:9620>
10/16 21:28:54 Log file not found in config file: GCB_BROKER_LOG
10/16 21:28:54 Started DaemonCore process "/opt/condor-release-6.8.8/sbin/condor_collector", pid and pgroup = 5222 10/16 21:28:54 Started DaemonCore process "/opt/condor-release-6.8.8/sbin/condor_negotiator", pid and pgroup = 5223 10/16 21:28:54 Started DaemonCore process "/opt/condor-release-6.8.8/sbin/condor_startd", pid and pgroup = 5224 10/16 21:28:54 Started DaemonCore process "/opt/condor-release-6.8.8/sbin/condor_schedd", pid and pgroup = 5225 10/16 21:28:54 Started process "/opt/condor-release-6.8.8/libexec/gcb_broker", pid and pgroup = 5226


[jgrunche@epistasis ~]$ cat /home/condor/log/StartLog
10/16 21:28:54 ******************************************************
10/16 21:28:54 ** condor_startd (CONDOR_STARTD) STARTING UP
10/16 21:28:54 ** /opt/condor-release-6.8.8/sbin/condor_startd
10/16 21:28:54 ** $CondorVersion: 6.8.8 Dec 19 2007 $
10/16 21:28:54 ** $CondorPlatform: I386-LINUX_RHEL3 $
10/16 21:28:54 ** PID = 5224
10/16 21:28:54 ** Log last touched time unavailable (No such file or directory)
10/16 21:28:54 ******************************************************
10/16 21:28:54 Using config source: /home/condor/condor_config
10/16 21:28:54 Using local config sources:
10/16 21:28:54    /home/condor/condor_config.local
10/16 21:28:54 DaemonCore: Command Socket at <129.130.131.132:9644>
10/16 21:28:55 vm1: New machine resource allocated
10/16 21:28:55 vm2: New machine resource allocated
10/16 21:28:55 About to run initial benchmarks.
10/16 21:29:02 Completed initial benchmarks.
10/16 21:29:02 vm1: State change: IS_OWNER is false
10/16 21:29:02 vm1: Changing state: Owner -> Unclaimed
10/16 21:29:02 vm2: State change: IS_OWNER is false
10/16 21:29:02 vm2: Changing state: Owner -> Unclaimed
10/16 21:29:02 GCB: ERROR "GCB_bind: binding the socket locally failed" errno 98: Address already in use 10/16 21:29:07 GCB: ERROR "GCB_bind: binding the socket locally failed" errno 98: Address already in use



[jgrunche@epistasis ~]$ cat /home/condor/log/SchedLog
10/16 21:28:54 (pid:5225) ******************************************************
10/16 21:28:54 (pid:5225) ** condor_schedd (CONDOR_SCHEDD) STARTING UP
10/16 21:28:54 (pid:5225) ** /opt/condor-release-6.8.8/sbin/condor_schedd
10/16 21:28:54 (pid:5225) ** $CondorVersion: 6.8.8 Dec 19 2007 $
10/16 21:28:54 (pid:5225) ** $CondorPlatform: I386-LINUX_RHEL3 $
10/16 21:28:54 (pid:5225) ** PID = 5225
10/16 21:28:54 (pid:5225) ** Log last touched time unavailable (No such file or directory) 10/16 21:28:54 (pid:5225) ******************************************************
10/16 21:28:54 (pid:5225) Using config source: /home/condor/condor_config
10/16 21:28:54 (pid:5225) Using local config sources:
10/16 21:28:54 (pid:5225)    /home/condor/condor_config.local
10/16 21:28:54 (pid:5225) DaemonCore: Command Socket at <129.130.131.132:9623>
10/16 21:28:54 (pid:5225) History file rotation is enabled.
10/16 21:28:54 (pid:5225)   Maximum history file size is: 20971520 bytes
10/16 21:28:54 (pid:5225)   Number of rotated history files is: 2
10/16 21:28:54 (pid:5225) Sent ad to central manager for me@xxxxxxxxxxxxxxxxxxx
10/16 21:28:54 (pid:5225) Sent ad to 1 collectors for me@xxxxxxxxxxxxxxxxxxx
10/16 21:28:54 (pid:5225) After chmod(), still can't remove "/tmp/condor_g_scratch.0x9931278.4435" as directory owner, giving up!
10/16 21:28:54 (pid:5225) Started condor_gmanager for owner me pid=5239
10/16 21:30:49 (pid:5225) condor_gridmanager (PID 5239, owner me) exited with return code 0. 10/16 21:33:54 (pid:5225) GCB: ERROR "GCB_bind: binding the socket locally failed" errno 98: Address already in use
10/16 21:33:54 (pid:5225) Sent owner (0 jobs) ad to 1 collectors








[me@mycomputer ~]$ cat /home/condor/log/CollectorLog
10/16 21:28:54 ******************************************************
10/16 21:28:54 ** condor_collector (CONDOR_COLLECTOR) STARTING UP
10/16 21:28:54 ** /opt/condor-release-6.8.8/sbin/condor_collector
10/16 21:28:54 ** $CondorVersion: 6.8.8 Dec 19 2007 $
10/16 21:28:54 ** $CondorPlatform: I386-LINUX_RHEL3 $
10/16 21:28:54 ** PID = 5222
10/16 21:28:54 ** Log last touched time unavailable (No such file or directory)
10/16 21:28:54 ******************************************************
10/16 21:28:54 Using config source: /home/condor/condor_config
10/16 21:28:54 Using local config sources:
10/16 21:28:54    /home/condor/condor_config.local
10/16 21:28:54 DaemonCore: Command Socket at <129.130.131.132:9618>
10/16 21:28:54 In ViewServer::Init()
10/16 21:28:54 In CollectorDaemon::Init()
10/16 21:28:54 In ViewServer::Config()
10/16 21:28:54 In CollectorDaemon::Config()
10/16 21:28:54 enable: Creating stats hash table
10/16 21:28:54 (Sending 0 ads in response to query)
10/16 21:28:54 Got QUERY_STARTD_PVT_ADS
10/16 21:28:54 (Sending 0 ads in response to query)
10/16 21:28:54 NegotiatorAd  : Inserting ** "< mycomputer.ed.ac.uk >"
10/16 21:28:54 stats: Inserting new hashent for 'Negotiator':'mycomputer.ed.ac.uk':'129.130.131.132'
10/16 21:28:54 WARNING:  No master ad for < mycomputer.ed.ac.uk >
10/16 21:28:54 ScheddAd : Inserting ** "< mycomputer.ed.ac.uk , 129.130.131.132 >" 10/16 21:28:54 stats: Inserting new hashent for 'Schedd':'mycomputer.ed.ac.uk':'129.130.131.132' 10/16 21:28:54 SubmittorAd : Inserting ** "< me@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx , 129.130.131.132 >" 10/16 21:28:54 stats: Inserting new hashent for 'Submittor':'me@xxxxxxxxxxxxxxxxxxx':'129.130.131.132' 10/16 21:28:59 ** Master < mycomputer.ed.ac.uk > rejuvenated from recently down 10/16 21:28:59 stats: Inserting new hashent for 'Master':'mycomputer.ed.ac.uk':'129.130.131.132'
10/16 21:29:06 WARNING:  No master ad for < vm1@xxxxxxxxxxxxxxxxxxx >
10/16 21:29:06 StartdAd : Inserting ** "< vm1@xxxxxxxxxxxxxxxxxxx , 129.130.131.132 >" 10/16 21:29:06 stats: Inserting new hashent for 'Start':'vm1@xxxxxxxxxxxxxxxxxxx':'129.130.131.132' 10/16 21:29:06 StartdPvtAd : Inserting ** "< vm1@xxxxxxxxxxxxxxxxxxx , 129.130.131.132 >" 10/16 21:29:06 stats: Inserting new hashent for 'StartdPvt':'vm1@xxxxxxxxxxxxxxxxxxx':'129.130.131.132'
10/16 21:29:07 WARNING:  No master ad for < vm2@xxxxxxxxxxxxxxxxxxx >
10/16 21:29:07 StartdAd : Inserting ** "< vm2@xxxxxxxxxxxxxxxxxxx , 129.130.131.132 >" 10/16 21:29:07 stats: Inserting new hashent for 'Start':'vm2@xxxxxxxxxxxxxxxxxxx':'129.130.131.132' 10/16 21:29:07 StartdPvtAd : Inserting ** "< vm2@xxxxxxxxxxxxxxxxxxx , 129.130.131.132 >" 10/16 21:29:07 stats: Inserting new hashent for 'StartdPvt':'vm2@xxxxxxxxxxxxxxxxxxx':'129.130.131.132'



[me@mycomputer ~]$ cat /home/condor/log/GCB_BrokerLog
10/16 21:28:54 ****************************************
10/16 21:28:54 New log file started
10/16 21:28:54 Max size = 640000
10/16 21:28:54 Log level: D_BASIC
10/16 21:28:54 ****************************************
10/16 21:28:54 [broker.C:199] ++++++++++++++++++++++++++++++
10/16 21:28:54 [broker.C:200] + STARTING Broker (pid: 5226)
10/16 21:28:54 [broker.C:201] + $GCBVersion: 1.3.2 $
10/16 21:28:54 [broker.C:202] + $GCBBuildDate: Dec 19 2007 $
10/16 21:28:54 [broker.C:255] + Listening at 129.130.131.132:65432
10/16 21:28:54 [broker.C:275] + Using relay_server: /opt/condor-release-6.8.8/libexec/gcb_relay_server
10/16 21:28:54 [broker.C:276] ++++++++++++++++++++++++++++++

Thank you for your help,

Jean-Alain













Jean-Alain,

It sounds like there are two problems when your glideins try to run:

10/15 22:19:10 Can't connect to <129.130.131.132:9618>:0, errno = 113

Is that the address of the collector to which the glideins should be advertising themselves? Are there any firewalls or anything that would prevent them from connecting?

10/15 22:19:05 passwd_cache::cache_uid(): getpwnam("condor") failed: Success

This confusing line in the logs should be ignored. It is no longer produced in the 7.0 or 7.1 series of condor.

10/15 22:19:07 ERROR "Ran out of system resources" at line 361 in file ResMgr.C

I don't know what is going wrong. Do you have any special SMP-related configuration in your glidein configuration? Normally, one configures glideins with NUM_CPUS=1 to force each instance of glidein to only advertise a single slot, rather than one per cpu on the machine.

--Dan

Jean-Alain Grunchec wrote:

Hi,

I am trying to run GlideIn jobs on the UK National Grid Service. I set up a local machine as a Condor Central Manager. I put Globus-Toolkit 4 on it.

Now I try to submit GlideIn jobs on a HPC in Leeds (the ultimate idea being the submission of many GlideIn jobs to several NGS resources).

So I start the following command, which starts something at least in Leeds.




Ideally, at this point I would have 10 new machines added to my Condor pool, so I check





[me@mycomputer test]$ condor_glidein -count 10 -arch 6.6.7-i686-pc-Linux-2.4 -setup_jobmanager jobmanager-fork ngs.leeds.ac.uk/jobmanager-pbs

Running/verifying Glidein installation and setup...
Submitting Glidein setup job...
Installing /home/ngs0123/Condor_glidein/glidein_condor_config.
Installing /home/ngs0123/Condor_glidein/6.6.7-i686-pc-Linux-2.4/glidein_startup. Installing Condor daemons in /home/ngs0123/Condor_glidein/6.6.7-i686-pc-Linux-2.4.

Downloaded http://www.cs.wisc.edu/condor/glidein/binaries/6.6.7-i686-pc-Linux-2.4.tar.gz to /home/ngs0123/Condor_glidein/6.6.7-i686-pc-Linux-2.4.

Installation successfully completed.

Launching Glidein...
Submitting Glidein job...
Submitting job(s).
1 job(s) submitted to cluster 5.
You have new mail in /var/spool/mail/me


Ideally, at this point I would have 10 new machine added to my Condor pool, so I check, but there is no new machine there. I read the email sent :




Date: Wed, 15 Oct 2008 22:14:19 +0100
From: Me <me@xxxxxxxxxxxxxxxxxxx>
Message-Id: <200810152114.m9FLEJAu003357@xxxxxxxxxxxxxxxxxxx>
To: me@xxxxxxxxxxxxxxxxxxx
Subject: [Condor] Condor Job 4.0

This is an automated email from the Condor system
on machine "mycomputer.ed.ac.uk".  Do not reply.

Your Condor job 4.0
/home/me/test/glidein_remote_setup.3117 $(HOME)/Condor_glidein $(HOME)/Condor_glidein/6.6.7-i686-pc-Linux-2.4 6.6.7-i686-pc-Linux-2.4 $(HOME)/Condor_glidein/local 'http://www.cs.wisc.edu/condor/glidein/binaries,gsiftp://gridftp.cs.wisc.edu/p/condor/public/binaries/glidein' 0
has exited.


Submitted at:        Wed Oct 15 22:10:55 2008
Completed at:        Wed Oct 15 22:14:19 2008
Real Time:             0 00:03:24

Something has run somehow, but I am not sure GlideIn jobs really ran OK.

So I try to see on the headnode in Leeds if there are some temporary files left, and yes, there are a few.


10/15 22:19:05 ******************************************************
10/15 22:19:05 ** condor_master (CONDOR_MASTER) STARTING UP
10/15 22:19:05 ** /nfs/ift02_h01/home/ngs0123/Condor_glidein/6.6.7-i686-pc-Linux-2.4/condor_master
10/15 22:19:05 ** $CondorVersion: 6.6.7 Oct 11 2004 $
10/15 22:19:05 ** $CondorPlatform: I386-LINUX_RH9 $
10/15 22:19:05 ** PID = 20178
10/15 22:19:05 ******************************************************
10/15 22:19:05 Using config file: /home/ngs0123/Condor_glidein/glidein_condor_config
10/15 22:19:05 DaemonCore: Command Socket at <10.141.0.9:53102>
10/15 22:19:05 Started DaemonCore process "/home/ngs0123/Condor_glidein/6.6.7-i686-pc-Linux-2.4/condor_startd", pid and pgroup = 20179
10/15 22:19:07 The STARTD (pid 20179) exited with status 4
10/15 22:19:07 Sending obituary for "/home/ngs0123/Condor_glidein/6.6.7-i686-pc-Linux-2.4/condor_startd" 10/15 22:19:07 restarting /home/ngs0123/Condor_glidein/6.6.7-i686-pc-Linux-2.4/condor_startd in 10 seconds
10/15 22:19:10 Can't connect to <129.130.131.132:9618>:0, errno = 113
10/15 22:19:10 Will keep trying for 10 seconds...
10/15 22:19:19 Connect failed for 10 seconds; returning FALSE
10/15 22:19:19 ERROR:
SECMAN:2003:TCP connection to <129.130.131.132:9618> failed


Here apparently there is a connection to the server issue.


I read at the CONDOR_STARTD in Leeds and it is even more bizarre.

[ngs0123@ngs log.10.141.0.9-20178]$ cat StartdLog
10/15 22:19:05 passwd_cache::cache_uid(): getpwnam("condor") failed: Success

10/15 22:19:05 passwd_cache::cache_uid(): getpwnam("condor") failed: Success

10/15 22:19:05 ******************************************************
10/15 22:19:05 ** condor_startd (CONDOR_STARTD) STARTING UP
10/15 22:19:05 ** /nfs/ift02_h01/home/ngs0123/Condor_glidein/6.6.7-i686-pc-Linux-2.4/condor_startd
10/15 22:19:05 ** $CondorVersion: 6.6.7 Oct 11 2004 $
10/15 22:19:05 ** $CondorPlatform: I386-LINUX_RH9 $
10/15 22:19:05 ** PID = 20179
10/15 22:19:05 ******************************************************
10/15 22:19:05 Using config file: /home/ngs0123/Condor_glidein/glidein_condor_config
10/15 22:19:05 DaemonCore: Command Socket at <10.141.0.9:48585>
10/15 22:19:07 ERROR: Can't allocate 5th virtual machine of type 0
       Requesting: Cpus: 1, Memory: 1, Swap: 10.00%, Disk: 10.00%
       Available:  Cpus: 6, Memory: 0, Swap: 60.00%, Disk: 60.00%
10/15 22:19:07 ERROR "Ran out of system resources" at line 361 in file ResMgr.C
10/15 22:19:31 passwd_cache::cache_uid(): getpwnam("condor") failed: Success

10/15 22:19:31 passwd_cache::cache_uid(): getpwnam("condor") failed: Success


Has anybody managed to use glideIn on the NGS ? Alternatively, if somebody has used glideIn on another Grid, your experience may help me.

Thank you very much,

J-A


_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at: https://lists.cs.wisc.edu/archive/condor-users/



--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.