[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Trying to run condor_glidein on the National Grid Service




Jean-Alain,

It sounds like there are two problems when your glideins try to run:

10/15 22:19:10 Can't connect to <129.130.131.132:9618>:0, errno = 113

Is that the address of the collector to which the glideins should be advertising themselves? Are there any firewalls or anything that would prevent them from connecting?

10/15 22:19:05 passwd_cache::cache_uid(): getpwnam("condor") failed: Success

This confusing line in the logs should be ignored. It is no longer produced in the 7.0 or 7.1 series of condor.

10/15 22:19:07 ERROR "Ran out of system resources" at line 361 in file ResMgr.C

I don't know what is going wrong. Do you have any special SMP-related configuration in your glidein configuration? Normally, one configures glideins with NUM_CPUS=1 to force each instance of glidein to only advertise a single slot, rather than one per cpu on the machine.

--Dan

Jean-Alain Grunchec wrote:

Hi,

I am trying to run GlideIn jobs on the UK National Grid Service. I set up a local machine as a Condor Central Manager. I put Globus-Toolkit 4 on it.

Now I try to submit GlideIn jobs on a HPC in Leeds (the ultimate idea being the submission of many GlideIn jobs to several NGS resources).

So I start the following command, which starts something at least in Leeds.




Ideally, at this point I would have 10 new machines added to my Condor pool, so I check





[me@mycomputer test]$ condor_glidein -count 10 -arch 6.6.7-i686-pc-Linux-2.4 -setup_jobmanager jobmanager-fork ngs.leeds.ac.uk/jobmanager-pbs

Running/verifying Glidein installation and setup...
Submitting Glidein setup job...
Installing /home/ngs0123/Condor_glidein/glidein_condor_config.
Installing /home/ngs0123/Condor_glidein/6.6.7-i686-pc-Linux-2.4/glidein_startup. Installing Condor daemons in /home/ngs0123/Condor_glidein/6.6.7-i686-pc-Linux-2.4.

Downloaded http://www.cs.wisc.edu/condor/glidein/binaries/6.6.7-i686-pc-Linux-2.4.tar.gz to /home/ngs0123/Condor_glidein/6.6.7-i686-pc-Linux-2.4.

Installation successfully completed.

Launching Glidein...
Submitting Glidein job...
Submitting job(s).
1 job(s) submitted to cluster 5.
You have new mail in /var/spool/mail/me


Ideally, at this point I would have 10 new machine added to my Condor pool, so I check, but there is no new machine there. I read the email sent :




Date: Wed, 15 Oct 2008 22:14:19 +0100
From: Me <me@xxxxxxxxxxxxxxxxxxx>
Message-Id: <200810152114.m9FLEJAu003357@xxxxxxxxxxxxxxxxxxx>
To: me@xxxxxxxxxxxxxxxxxxx
Subject: [Condor] Condor Job 4.0

This is an automated email from the Condor system
on machine "mycomputer.ed.ac.uk".  Do not reply.

Your Condor job 4.0
/home/me/test/glidein_remote_setup.3117 $(HOME)/Condor_glidein $(HOME)/Condor_glidein/6.6.7-i686-pc-Linux-2.4 6.6.7-i686-pc-Linux-2.4 $(HOME)/Condor_glidein/local 'http://www.cs.wisc.edu/condor/glidein/binaries,gsiftp://gridftp.cs.wisc.edu/p/condor/public/binaries/glidein' 0
has exited.


Submitted at:        Wed Oct 15 22:10:55 2008
Completed at:        Wed Oct 15 22:14:19 2008
Real Time:             0 00:03:24

Something has run somehow, but I am not sure GlideIn jobs really ran OK.

So I try to see on the headnode in Leeds if there are some temporary files left, and yes, there are a few.


10/15 22:19:05 ******************************************************
10/15 22:19:05 ** condor_master (CONDOR_MASTER) STARTING UP
10/15 22:19:05 ** /nfs/ift02_h01/home/ngs0123/Condor_glidein/6.6.7-i686-pc-Linux-2.4/condor_master
10/15 22:19:05 ** $CondorVersion: 6.6.7 Oct 11 2004 $
10/15 22:19:05 ** $CondorPlatform: I386-LINUX_RH9 $
10/15 22:19:05 ** PID = 20178
10/15 22:19:05 ******************************************************
10/15 22:19:05 Using config file: /home/ngs0123/Condor_glidein/glidein_condor_config
10/15 22:19:05 DaemonCore: Command Socket at <10.141.0.9:53102>
10/15 22:19:05 Started DaemonCore process "/home/ngs0123/Condor_glidein/6.6.7-i686-pc-Linux-2.4/condor_startd", pid and pgroup = 20179
10/15 22:19:07 The STARTD (pid 20179) exited with status 4
10/15 22:19:07 Sending obituary for "/home/ngs0123/Condor_glidein/6.6.7-i686-pc-Linux-2.4/condor_startd" 10/15 22:19:07 restarting /home/ngs0123/Condor_glidein/6.6.7-i686-pc-Linux-2.4/condor_startd in 10 seconds
10/15 22:19:10 Can't connect to <129.130.131.132:9618>:0, errno = 113
10/15 22:19:10 Will keep trying for 10 seconds...
10/15 22:19:19 Connect failed for 10 seconds; returning FALSE
10/15 22:19:19 ERROR:
SECMAN:2003:TCP connection to <129.130.131.132:9618> failed


Here apparently there is a connection to the server issue.


I read at the CONDOR_STARTD in Leeds and it is even more bizarre.

[ngs0123@ngs log.10.141.0.9-20178]$ cat StartdLog
10/15 22:19:05 passwd_cache::cache_uid(): getpwnam("condor") failed: Success

10/15 22:19:05 passwd_cache::cache_uid(): getpwnam("condor") failed: Success

10/15 22:19:05 ******************************************************
10/15 22:19:05 ** condor_startd (CONDOR_STARTD) STARTING UP
10/15 22:19:05 ** /nfs/ift02_h01/home/ngs0123/Condor_glidein/6.6.7-i686-pc-Linux-2.4/condor_startd
10/15 22:19:05 ** $CondorVersion: 6.6.7 Oct 11 2004 $
10/15 22:19:05 ** $CondorPlatform: I386-LINUX_RH9 $
10/15 22:19:05 ** PID = 20179
10/15 22:19:05 ******************************************************
10/15 22:19:05 Using config file: /home/ngs0123/Condor_glidein/glidein_condor_config
10/15 22:19:05 DaemonCore: Command Socket at <10.141.0.9:48585>
10/15 22:19:07 ERROR: Can't allocate 5th virtual machine of type 0
       Requesting: Cpus: 1, Memory: 1, Swap: 10.00%, Disk: 10.00%
       Available:  Cpus: 6, Memory: 0, Swap: 60.00%, Disk: 60.00%
10/15 22:19:07 ERROR "Ran out of system resources" at line 361 in file ResMgr.C
10/15 22:19:31 passwd_cache::cache_uid(): getpwnam("condor") failed: Success

10/15 22:19:31 passwd_cache::cache_uid(): getpwnam("condor") failed: Success


Has anybody managed to use glideIn on the NGS ? Alternatively, if somebody has used glideIn on another Grid, your experience may help me.

Thank you very much,

J-A