[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] trying to get HDFS working



Any suggestions from the Condor team on what might be wrong or where
else I might look?

- dave


On Fri, 2010-04-30 at 14:10 -0500, David A. Kotz wrote:
> Thanks, Dan, but I think I have all of those covered.  I corrected the 
> HDFS setting:
> 
>    HDFS = $(SBIN)/condor_hdfs
> 
> pointed JAVA at our Java 6 install, and pointed HDFS_HOME at our 
> existing Hadoop install.  I also have HDFS_NAMENODE set to the machine 
> with HDFS_SERVICES = HDFS_NAMENODE, and I have HDFS_SERVICES = 
> HDFS_DATANODE on the other machines.  I also changed the 
> DedicatedScheduler to point to the namenode as well, because I think I 
> ran across something which seemed to indicate I should.
> 
> - dave
> 
> 
> Dan Bradley wrote:
> > 
> > I have also tried to get Condor's HDFS support to work.  I haven't quite 
> > finished, but what I found may be helpful to you.
> > 
> > In my case, I found that the condor package did not contain the 
> > necessary HDFS jar files.  I had to download these and install them in 
> > condor's libexec/hdfs/lib directory.  I used the 0.20.2 hadoop release.
> > 
> > I also found that the version of java on my system (gij (GNU libgcj) 
> > version 4.1.2) did not appear to work with HDFS.  Instead, I used 
> > jdk1.6.0_20 from Sun.
> > 
> > I also found that the documentation for HDFS_SERVICES is confusing.  It 
> > appears that it is supposed to be set equal to either HDFS_NAMENODE or 
> > HDFS_DATANODE.
> > 
> > Hope that helps.
> > 
> > --Dan
> > 
> > David A. Kotz wrote:
> >> I'm running Condor 7.4.2 on Linux, and I'm trying unsuccessfully to 
> >> get HDFS running.  The HDFS daemon seems to load, but then immediately 
> >> exits.
> >>
> >> I've set up one machine as the namenode and a number of our cluster 
> >> nodes as data nodes.  I created and chowned to the Condor user the 
> >> HDFS_NAMENODE_DIR and HDFS_DATANODE_DIR directories on these machines. 
> >> I left HDFS_DATANODE_ADDRESS = 0.0.0.0:0 because the docs seem to 
> >> indicate that it's okay to do so.
> >>
> >> Hadoop is version 0.20.2.
> >>
> >> When I start the HDFS daemon it loads and exits normally according to 
> >> the Masterlog:
> >>
> >> 04/30 12:08:42 Started process "/lusr/condor/sbin/condor_hdfs", pid 
> >> and pgroup = 28706
> >> 04/30 12:08:42 The HDFS (pid 28706) exited with status 0
> >> 04/30 12:08:42 restarting /lusr/condor/sbin/condor_hdfs in 3600 seconds
> >>
> >>
> >> HDFS_LOG4J=DEBUG and HDFS_DEBUG=D_ALL
> >>
> >>
> >> Namenode log shows:
> >>
> >> 04/30 12:08:42 (fd:3) (pid:28706) NET_REMAP_ENABLE is undefined, using 
> >> default value of False
> >> 04/30 12:08:42 (fd:3) (pid:28706) LOGS_USE_TIMESTAMP is undefined, 
> >> using default value of False
> >> 04/30 12:08:42 (fd:3) (pid:28706) config: using subsystem 'HDFS', 
> >> local ''
> >> 04/30 12:08:42 (fd:3) (pid:28706) Reading from /proc/cpuinfo
> >> 04/30 12:08:42 (fd:3) (pid:28706) Found: Physical-IDs:True; Core-IDs:True
> >> 04/30 12:08:42 (fd:3) (pid:28706) Analyzing 2 processors using IDs...
> >> 04/30 12:08:42 (fd:3) (pid:28706) Looking at processor #0 (PID:0, CID:0):
> >> 04/30 12:08:42 (fd:3) (pid:28706) Comparing P#0   and P#1  : pid:0!=0 
> >> or  cid:0!=1 (match=No)
> >> 04/30 12:08:42 (fd:3) (pid:28706) ncpus = 1
> >> 04/30 12:08:42 (fd:3) (pid:28706) P0: match->1
> >> 04/30 12:08:42 (fd:3) (pid:28706) Looking at processor #1 (PID:0, CID:1):
> >> 04/30 12:08:42 (fd:3) (pid:28706) ncpus = 2
> >> 04/30 12:08:42 (fd:3) (pid:28706) P1: match->1
> >> 04/30 12:08:42 (fd:3) (pid:28706) Using IDs: 2 processors, 2 CPUs, 0 HTs
> >> 04/30 12:08:42 (fd:3) (pid:28706) Reading condor configuration from 
> >> '/lusr/condor/etc/condor_config'
> >> 04/30 12:08:42 (fd:3) (pid:28706) Finding local host information, 
> >> calling gethostname()
> >> 04/30 12:08:42 (fd:3) (pid:28706) gethostname() returned fully 
> >> qualified name "carrion.cs.utexas.edu"
> >> 04/30 12:08:42 (fd:3) (pid:28706) NET_REMAP_ENABLE is undefined, using 
> >> default value of False
> >> 04/30 12:08:42 (fd:3) (pid:28706) PASSWD_CACHE_REFRESH is undefined, 
> >> using default value of 319
> >> 04/30 12:08:42 (fd:3) (pid:28706) Trying to initialize local IP 
> >> address (config file not read)
> >> 04/30 12:08:42 (fd:3) (pid:28706) Have not found an IP yet, calling 
> >> gethostbyname()
> >> 04/30 12:08:42 (fd:3) (pid:28706) Trying to find IP addr for 
> >> "carrion.cs.utexas.edu"
> >> 04/30 12:08:42 (fd:3) (pid:28706) Calling 
> >> gethostbyname(carrion.cs.utexas.edu)
> >> 04/30 12:08:42 (fd:3) (pid:28706) Found IP addr in hostent: 128.83.120.7
> >> 04/30 12:08:42 (fd:3) (pid:28706) ENABLE_RUNTIME_CONFIG is undefined, 
> >> using default value of False
> >> 04/30 12:08:42 (fd:3) (pid:28706) ENABLE_PERSISTENT_CONFIG is 
> >> undefined, using default value of False
> >> 04/30 12:08:42 (fd:3) (pid:28706) Trying to initialize local IP 
> >> address (after reading config)
> >> 04/30 12:08:42 (fd:3) (pid:28706) NETWORK_INTERFACE not in config 
> >> file, using existing value
> >> 04/30 12:08:42 (fd:3) (pid:28706) ABORT_ON_EXCEPTION is undefined, 
> >> using default value of False
> >> 04/30 12:08:42 (fd:3) (pid:28706) Config 'HDFS_LOG': no prefix ==> 
> >> '$(LOG)/HDFSLog'
> >> 04/30 12:08:42 (fd:3) (pid:28706) Config 'MAX_HDFS_LOG': no prefix ==> 
> >> '1000000'
> >> 04/30 12:08:42 (fd:3) (pid:28706) PRIV_UNKNOWN --> PRIV_CONDOR at 
> >> daemon_core_main.cpp:1835
> >> 04/30 12:08:42 (fd:3) (pid:28707) KEYCACHE: created: 0x84bf5b0
> >> 04/30 12:08:42 (fd:3) (pid:28707) WANT_UDP_COMMAND_SOCKET is 
> >> undefined, using default value of True
> >> 04/30 12:08:42 (fd:3) (pid:28707) HDFS_MAX_FILE_DESCRIPTORS is 
> >> undefined, using default value of 0
> >> 04/30 12:08:42 (fd:3) (pid:28707) MAX_FILE_DESCRIPTORS is undefined, 
> >> using default value of 0
> >> 04/30 12:08:42 (fd:3) (pid:28707) 
> >> ******************************************************
> >> 04/30 12:08:42 (fd:3) (pid:28707) ** condor_hdfs (CONDOR_HDFS) 
> >> STARTING UP
> >> 04/30 12:08:42 (fd:3) (pid:28707) ** 
> >> /lusr/opt/condor-7.4.2/sbin/condor_hdfs
> >> 04/30 12:08:42 (fd:3) (pid:28707) ** SubsystemInfo: name=HDFS 
> >> type=DAEMON(11) class=DAEMON(1)
> >> 04/30 12:08:42 (fd:3) (pid:28707) ** Configuration: subsystem:HDFS 
> >> local:<NONE> class:DAEMON
> >> 04/30 12:08:42 (fd:3) (pid:28707) ** $CondorVersion: 7.4.2 Mar 29 2010 
> >> BuildID: 227044 $
> >> 04/30 12:08:42 (fd:3) (pid:28707) ** $CondorPlatform: I386-LINUX_RHEL5 $
> >> 04/30 12:08:42 (fd:3) (pid:28707) ** PID = 28707
> >> 04/30 12:08:42 (fd:3) (pid:28707) ** Log last touched 4/30 11:53:36
> >> 04/30 12:08:42 (fd:3) (pid:28707) ** Running as root: Privilege 
> >> switching in effect
> >> 04/30 12:08:42 (fd:3) (pid:28707) 
> >> ******************************************************
> >> 04/30 12:08:42 (fd:3) (pid:28707) Using config source: 
> >> /lusr/condor/etc/condor_config
> >> 04/30 12:08:42 (fd:3) (pid:28707) Using local config sources:
> >> 04/30 12:08:42 (fd:3) (pid:28707)    /lusr/condor/etc/local/carrion
> >> 04/30 12:08:42 (fd:3) (pid:28707) Config 'LOG': no prefix ==> 
> >> '$(RELEASE_DIR)/log/$(HOSTNAME)'
> >> 04/30 12:08:42 (fd:3) (pid:28707) Running as root.  Enabling 
> >> specialized core dump routines
> >> 04/30 12:08:42 (fd:5) (pid:28707) Setting up command socket
> >> 04/30 12:08:42 (fd:5) (pid:28707) CONDOR_INHERIT: "31595 
> >> <128.83.120.7:57510> 0 0"
> >> 04/30 12:08:42 (fd:5) (pid:28707) Parent PID = 31595
> >> 04/30 12:08:42 (fd:5) (pid:28707) Parent Command Sock = 
> >> <128.83.120.7:57510>
> >> 04/30 12:08:42 (fd:7) (pid:28707) LISTEN <128.83.120.7:45693> fd=5
> >> 04/30 12:08:42 (fd:7) (pid:28707)
> >> _______________________________________________
> >> Condor-users mailing list
> >> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> >> subject: Unsubscribe
> >> You can also unsubscribe by visiting
> >> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> >>
> >> The archives can be found at:
> >> https://lists.cs.wisc.edu/archive/condor-users/
> > _______________________________________________
> > Condor-users mailing list
> > To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> > subject: Unsubscribe
> > You can also unsubscribe by visiting
> > https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> > 
> > The archives can be found at:
> > https://lists.cs.wisc.edu/archive/condor-users/
> 
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/