Quantcast
Channel: Steve Hilker's Groups Activities
Viewing all articles
Browse latest Browse all 318

Streaming Oracle Database Logs to HBase with Flume

$
0
0
In the previous tutorial we discussed streaming Oracle logs to HDFS using Flume. Flume supports various types of sources and sinks including the HBase database as a sink. In this tutorial we shall discuss streaming Oracle log file to HBase. This tutorial has the following sections.
 
 

Setting the Environment

 
We have used the same environment as in the streaming to HDFS. Oracle Database 11g is installed on Oracle Linux 6.5 on VirtualBox 4.3. We need to download and install the following software.
 
  1. Oracle Database 11g
  2. HBase
  3. Java 7
  4. Flume 1.4
  5. Hadoop 2.0.0
 
First, create a directory to install the software and set its permissions.
 
mkdir /flume
chmod -R 777 /flume
cd /flume
 
Create the hadoop group and add the hbase user to the hadoop group.
 
>groupadd hadoop
>useradd –g hadoop hbase
 
Download and install Java 7.
 
>tar zxvf jdk-7u55-linux-i586.tar.gz
 
 Download and install CDH 4.6 Hadoop 2.0.0.
 
>wget http://archive.cloudera.com/cdh4/cdh/4/hadoop-2.0.0-cdh4.6.0.tar.gz
>tar -xvf hadoop-2.0.0-cdh4.6.0.tar.gz
 
Create symlinks for Hadoop bin and conf files.
 
>ln -s /flume/hadoop-2.0.0-cdh4.6.0/bin-mapreduce1 /flume/hadoop-2.0.0-cdh4.6.0/share/hadoop/mapreduce1/bin
>ln -s /flume/hadoop-2.0.0-cdh4.6.0/etc/hadoop /flume/hadoop-2.0.0-cdh4.6.0/share/hadoop/mapreduce1/conf
 
Download and install CDH 4.6 Flume 1.4.9.
 
wget http://archive-primary.cloudera.com/cdh4/cdh/4/flume-ng-1.4.0-cdh4.6.0.tar.gz
tar -xvf flume-ng-1.4.0-cdh4.6.0.tar.gz
 
Download and install CDH 4.6 HBase 0.94.15.
 
wget http://archive.cloudera.com/cdh4/cdh/4/hbase-0.94.15-cdh4.6.0.tar.gz
tar -xvf hbase-0.94.15-cdh4.6.0.tar.gz
 
Set permissions of the Flume root directory to global.
 
chmod 777 -R /flume/apache-flume-1.4.0-cdh4.6.0-bin
 
Set the environment variables for Oracle Database, Java, HBase, Flume, and Hadoop in the bash shell file.
 
vi ~/.bashrc
 
export HADOOP_PREFIX=/flume/hadoop-2.0.0-cdh4.6.0
export HADOOP_CONF=$HADOOP_PREFIX/etc/hadoop
export FLUME_HOME=/flume/apache-flume-1.4.0-cdh4.6.0-bin
export FLUME_CONF=/flume/apache-flume-1.4.0-cdh4.6.0-bin/conf
export HBASE_HOME=/flume/hbase-0.94.15-cdh4.6.0
export HBASE_CONF=/flume/hbase-0.94.15-cdh4.6.0/conf
export JAVA_HOME=/flume/jdk1.7.0_55
export ORACLE_HOME=/home/oracle/app/oracle/product/11.2.0/dbhome_1
export ORACLE_SID=ORCL
export HADOOP_MAPRED_HOME=/flume/hadoop-2.0.0-cdh4.6.0
export HADOOP_HOME=/flume/hadoop-2.0.0-cdh4.6.0/share/hadoop/mapreduce1
export HADOOP_CLASSPATH=$HADOOP_HOME/*:$HADOOP_HOME/lib/*:$HBASE_CONF:$HBASE_HOME/lib/*
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_MAPRED_HOME/bin:$ORACLE_HOME/bin:$FLUME_HOME/bin:$HBASE_HOME/bin
export CLASSPATH=$HADOOP_CLASSPATH
 

Starting HDFS

 
In this section we shall configure and start HDFS. Cd to the Hadoop configuration directory.
 
cd /flume/hadoop-2.0.0-cdh4.6.0/etc/hadoop
 
Set the NameNode URI (fs.defaultFS) and the Hadoop temporary directory (hadoop.tmp.dir) configuration properties in the  core-site.xml  file.
 
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
  <name>fs.defaultFS</name>
    <value>hdfs://10.0.2.15:8020</value>
    </property>
 <property>
     <name>hadoop.tmp.dir</name>
     <value>file:///var/lib/hadoop-0.20/cache</value>
  </property>
</configuration>
 
Remove any previously created temporary directory and create the directory again and set its permissions to global.
 
rm -rf /var/lib/hadoop-0.20/cache
mkdir -p /var/lib/hadoop-0.20/cache
chmod -R 777  /var/lib/hadoop-0.20/cache
 
Set the NameNode storage directory (dfs.namenode.name.dir), superusergroup (dfs.permissions.superusergroup), replication factor (dfs.replication), the upper bound on the number of files the DataNode is able to serve concurrently (dfs.datanode.max.xcievers), and permission checking (dfs.permissions) configuration properties in the   hdfs-site.xml.
 
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
 <name>dfs.permissions.superusergroup</name>
  <value>hadoop</value>
  </property><property>
   <name>dfs.namenode.name.dir</name>
    <value>file:///data/1/dfs/nn</value>
    </property>
         <property>
               <name>dfs.replication</name>
                     <value>1</value>
         </property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
<property>
  <name>dfs.datanode.max.xcievers</name>
  <value>4096</value>
</property>
</configuration>
 
Remove any previously created NameNode storage directory and create a new directory and set its permissions to global.
 
rm -rf /data/1/dfs/nn
mkdir -p /data/1/dfs/nn
chmod -R 777 /data/1/dfs/nn
 
Format and start the NameNode.
 
hadoop namenode -format
hadoop namenode
 
Start the DataNode.
 
hadoop datanode
 
We need to copy the Flume lib directory jars to the HDFS to be available to the runtime. Create a directory in HDFS with the same directory structure as the Flume lib directory and set its permissions to global.
 
hadoop dfs -mkdir /flume/apache-flume-1.4.0-cdh4.6.0-bin/lib
hadoop dfs -chmod -R 777 /flume/apache-flume-1.4.0-cdh4.6.0-bin/lib
 
Put the Flume lib directory jars to the HDFS.
 
hdfs dfs -put   /flume/apache-flume-1.4.0-cdh4.6.0-bin/lib/*   hdfs://10.0.2.15:8020/flume/apache-flume-1.4.0-cdh4.6.0-bin/lib
 
Create the Flume configuration file flume.conf from the template. Also create the Flume env file flume-env.sh from the template.
 
cp $FLUME_HOME/conf/ flume-conf.properties.template $FLUME_HOME/conf/flume.conf
cp $FLUME_HOME/conf/flume-env.sh.template $FLUME_HOME/conf/flume-env.sh
 
We shall set the configuration properties for Flume in a subsequent section, but first we shall install HBase.
 

Starting HBase

 
In this section we shall configure and start HBase. HBase configuration is discussed in detail in another tutorial (http://www.toadworld.com/platforms/oracle/w/wiki/10976.loading-hbase-table-data-into-an-oracle-database-with-oracle-loader-for-hadoop.aspx). Set the HBase configuration in the  /flume/hbase-0.94.15-cdh4.6.0/conf/hbase-site.xml configuration file as follows.
 
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
  <property>
    <name>hbase.rootdir</name>
    <value>hdfs://10.0.2.15:8020/hbase</value>
  </property>
  <property>
    <name>hbase.zookeeper.property.dataDir</name>
    <value>/zookeeper</value>
  </property>
   <property>
      <name>hbase.zookeeper.property.clientPort</name>
      <value>2182</value>
    </property>
    <property>
      <name>hbase.zookeeper.quorum</name>
      <value>localhost</value>
    </property>
<property>
<name>hbase.regionserver.port</name>
<value>60020</value>
 </property>
<property>
<name>hbase.master.port</name>
<value>60000</value>
 </property>
</configuration>
 
Create the Zookeeper data directory and set its permissions.
 
mkdir -p /zookeeper
chmod -R 700 /zookeeper
 
As root user create the HBase root directory in HDFS /hbase and set its permissions to global (777).
 
root>hdfs dfs -mkdir /hbase
hdfs dfs -chmod -R 777 /hbase
 
As root user increase the maximum number of file handles in the /etc/security/limits.conf file. Set the following ulimit for hdfs and hbase users.
 
hdfs  -       nofile  32768
hbase -       nofile  32768
 
Start the HBase nodes Zookeeper, Master and Regionserver.
 
hbase-daemon.sh start zookeeper
hbase-daemon.sh start master
hbase-daemon.sh start regionserver
 
The jps command should list the HDFS and HBase nodes as started.
 
Start the HBase shell with the following command.
 
hbase shell
 
Create a table (flume) and a column family (orcllog) with the following command.
 
create 'flume' , 'orcllog'
 
The HBase table gets created.
 
  
 

Configuring  Flume Agent for HBase

 
In this section we shall set the Flume agent configuration in the flume.conf file. We shall configure the following properties in flume.conf for a Flume agent called hbase-agent.
 
Configuration Property
Description
Value
hbase-agent.channels
The Flume agent channels. We shall be using only channel called ch1 (the channel name is arbitrary).
hbase-agent.channels=ch1
hbase-agent.sources
The Flume agent sources. We shall be using one source of type exec called tail (the source name is arbitrary).
hbase-agent.sources=tail
hbase-agent.sinks
The Flume agent sinks. We shall be using one sink of type HBaseSink called sink1 (the sink name is arbitrary).
hbase-agent.sinks=sink1
hbase-agent.channels.ch1.type
The channel type is memory.
hbase-agent.channels.ch1.type=memory
hbase-agent.sources.tail.channels
Define the flow by binding the source to the channel.
hbase-agent.sources.tail.channels=ch1
hbase-agent.sources.tail.type
Specify the source type as exec.
hbase-agent.sources.tail.type=exec
hbase-agent.sources.tail.command
Runs the specified Unix command and produce data on stdout. Commonly used commands are the HDFS shell commands cat and tail for copying a complete log file  or the last KB of a log file to stdout. We shall be demonstrating both of these commands.
hbase-agent.sources.tail.command = tail -F /home/oracle/app/oracle/diag/rdbms/orcl/ORCL/trace
/alert_ORCL.log
 
or
 
hbase-agent.sources.tail.command = cat /home/oracle/app/oracle/diag/rdbms/orcl/ORCL/trace
/alert_ORCL.log
hbase-agent.sinks.sink1.channel
Define the flow by binding the sink to the channel.
hbase-agent.sinks.sink1.channel=ch1
hbase-agent.sinks.sink1.type
Specify the sink type as HbaseSink or
AsyncHbaseSink
 hbase-agent.sinks.sink1.type=org.apache.flume.sink.hbase.
HbaseSink
 
hbase-agent.sinks.sink1.table
Specify the HBase table name.
hbase-agent.sinks.sink1.table=flume
hbase-agent.sinks.sink1.columnFamily
Specify the HBase table column family
hbase-agent.sinks.sink1.columnFamily
=orcllog
 
hbase-agent.sinks.sink1.column
Specify the HBase table column family column. ??
 
hbase-agent.sinks.sink1.column=c1
hbase-agent.sinks.sink1.serializer
Specify the HBase event serializer class. The serializer converts a Flume event into one or more puts and or increments.
hbase-agent.sinks.sink1.serializer= org.apache.flume.sink.hbase.
SimpleHbaseEventSerializer
 
hbase-agent.sinks.sink1.serializer.
payloadColumn
A parameter to the serializer. Specifies the payload column, the column into which the payload data is stored.
 
hbase-agent.sinks.sink1.serializer.
payloadColumn
=coll
 
hbase-agent.sinks.sink1.serializer.
keyType
A parameter to the serializer. Specifies the key type.
 
hbase-agent.sinks.sink1.serializer.
keyType = timestamp
hbase-agent.sinks.sink1.serializer.
incrementColumn
A parameter to the serializer. Specifies the column to be incremented. The SimpleHbaseEventSerializer may optionally be set to increment a column in HBase.
hbase-agent.sinks.sink1.serializer.
incrementColumn=coll
hbase-agent.sinks.sink1.serializer.
rowPrefix
A parameter to the serializer. Specifies the row prefix to be used.
hbase-agent.sinks.sink1.
serializer.rowPrefix=1
hbase-agent.sinks.sink1.serializer.suffix
A parameter to the serializer. One of the following values may be set:
uuid
random
timestamp
hbase-agent.sinks.sink1.
serializer.suffix=timestamp
 
The flume.conf file is listed:
 
hbase-agent.sources=tail
hbase-agent.sinks=sink1
hbase-agent.channels=ch1
hbase-agent.sources.tail.type=exec
hbase-agent.sources.tail.command=tail -F /home/oracle/app/oracle/diag/rdbms/orcl/ORCL/trace
/alert_ORCL.log
hbase-agent.sources.tail.channels=ch1
hbase-agent.sinks.sink1.type=org.apache.flume.sink.hbase.HBaseSink
hbase-agent.sinks.sink1.channel=ch1
hbase-agent.sinks.sink1.table=flume
hbase-agent.sinks.sink1.columnFamily=orcllog
hbase-agent.sinks.sink1.column=c1
hbase-agent.sinks.sink1.serializer= org.apache.flume.sink.hbase.SimpleHbaseEventSerializer
hbase-agent.sinks.sink1.serializer.payloadColumn=coll
hbase-agent.sinks.sink1.serializer.keyType = timestamp
hbase-agent.sinks.sink1.serializer.incrementColumn=coll
hbase-agent.sinks.sink1.serializer.rowPrefix=1
hbase-agent.sinks.sink1.serializer.suffix=timestamp
hbase-agent.channels.ch1.type=memory
 
The alternative source exec command is as follows.
 
 hbase-agent.sources.tail.command=cat /home/oracle/app/oracle/diag/rdbms/orcl/ORCL/trace
/alert_ORCL.log
 

Running the Flume Agent

 
In this section we shall run the Flume agent to stream the last KB in the alert_ORCL.log file to HBase using the tail command. We shall also stream the complete alert log file alert_ORCL using the cat command. Run the Flume agent using the flume-ng shell script in which specify the agent name using the –n option, the configuration directory using the –conf option and the configuration file using the –f option. Specify the Flume logger Dflume.root.logger as INFO,console to log at INFO level to the console. Run the following command to run the Flume agent hbase-agent.
 
flume-ng agent --conf $FLUME_HOME/conf/ -f $FLUME_HOME/conf/flume.conf -n hbase-agent  -Dflume.root.logger=INFO,console
 
HBase libraries get included for HBase access.
 
 
The source and sink get started.
 
 
The flume log output provides more detail of the Fume agent command.
 
05 Dec 2014 22:20:57,147 INFO  [lifecycleSupervisor-1-0] (org.apache.flume.node.PollingPropertiesFileConfigurationProvider.start:61)  - Configuration provider starting
05 Dec 2014 22:20:57,194 INFO  [conf-file-poller-0] (org.apache.flume.node.PollingPropertiesFileConfigurationProvider$FileWatcherRunnable.run:133)  - Reloading configuration file:/flume/apache-flume-1.4.0-cdh4.6.0-bin/conf/flume.conf
05 Dec 2014 22:20:57,214 INFO  [conf-file-poller-0] (org.apache.flume.conf.FlumeConfiguration$AgentConfiguration.addProperty:1016)  - Processing:sink1
(org.apache.flume.conf.FlumeConfiguration.validateConfiguration:140)  - Post-validation flume configuration contains configuration for agents: [hbase-agent]
05 Dec 2014 22:20:57,502 INFO  [conf-file-poller-0] (org.apache.flume.node.AbstractConfigurationProvider.loadChannels:150)  - Creating channels
05 Dec 2014 22:20:57,529 INFO  [conf-file-poller-0] (org.apache.flume.channel.DefaultChannelFactory.create:40)  - Creating instance of channel ch1 type memory
05 Dec 2014 22:20:57,543 INFO  [conf-file-poller-0] (org.apache.flume.node.AbstractConfigurationProvider.loadChannels:205)  - Created channel ch1
05 Dec 2014 22:20:57,545 INFO  [conf-file-poller-0] (org.apache.flume.source.DefaultSourceFactory.create:39)  - Creating instance of source tail, type exec
05 Dec 2014 22:20:57,570 INFO  [conf-file-poller-0] (org.apache.flume.sink.DefaultSinkFactory.create:40)  - Creating instance of sink: sink1, type: org.apache.flume.sink.hbase.HBaseSink
05 Dec 2014 22:20:58,218 INFO  [conf-file-poller-0] (org.apache.flume.sink.hbase.HBaseSink.configure:218)  - The write to WAL option is set to: true
05 Dec 2014 22:20:58,223 INFO  [conf-file-poller-0] (org.apache.flume.node.AbstractConfigurationProvider.getConfiguration:119)  - Channel ch1 connected to [tail, sink1]
05 Dec 2014 22:20:58,238 INFO  [conf-file-poller-0] (org.apache.flume.node.Application.startAllComponents:138)  - Starting new configuration:{ sourceRunners:{tail=EventDrivenSourceRunner: { source:org.apache.flume.source.ExecSource{name:tail,state:IDLE} }} sinkRunners:{sink1=SinkRunner: { policy:org.apache.flume.sink.DefaultSinkProcessor@a21d88 counterGroup:{ name:null counters:{} } }} channels:{ch1=org.apache.flume.channel.MemoryChannel{name: ch1}} }
05 Dec 2014 22:20:58,240 INFO  [conf-file-poller-0] (org.apache.flume.node.Application.startAllComponents:145)  - Starting Channel ch1
05 Dec 2014 22:20:58,372 INFO  [lifecycleSupervisor-1-0] (org.apache.flume.instrumentation.MonitoredCounterGroup.register:119)  - Monitored counter group for type: CHANNEL, name: ch1: Successfully registered new MBean.
05 Dec 2014 22:20:58,373 INFO  [lifecycleSupervisor-1-0] (org.apache.flume.instrumentation.MonitoredCounterGroup.start:95)  - Component type: CHANNEL, name: ch1 started
05 Dec 2014 22:20:58,373 INFO  [conf-file-poller-0] (org.apache.flume.node.Application.startAllComponents:173)  - Starting Sink sink1
05 Dec 2014 22:20:58,375 INFO  [conf-file-poller-0] (org.apache.flume.node.Application.startAllComponents:184)  - Starting Source tail
05 Dec 2014 22:20:58,376 INFO  [lifecycleSupervisor-1-3] (org.apache.flume.source.ExecSource.start:163)  - Exec source starting with command:tail -F /home/oracle/app/oracle/diag/rdbms/orcl/ORCL/trace
05 Dec 2014 22:20:58,396 INFO  [lifecycleSupervisor-1-3] (org.apache.flume.instrumentation.MonitoredCounterGroup.register:119)  - Monitored counter group for type: SOURCE, name: tail: Successfully registered new MBean.
05 Dec 2014 22:20:58,397 INFO  [lifecycleSupervisor-1-3] (org.apache.flume.instrumentation.MonitoredCounterGroup.start:95)  - Component type: SOURCE, name: tail started
 

 Scanning HBase Table

 
In this section we shall scan the HBase table after running the Flume agent each time; after running the tail –f command and after running the cat command. Run the following command in HBase shell to scan the HBase table flume.
 
scan flume
 
The Oracle log file data streamed into HBase gets listed.
 
 
Run the scan flume command again after running the Flume agent with the cat /home/oracle/app/oracle/diag/rdbms/orcl/ORCL/trace /alert_ORCL.log command.
 
More rows get listed as the complete Oracle log file is streamed.

ChannelException

 
If the channel capacity gets exceeded while the Flume agent is streaming events the following exception may be generated.
 
: java.lang.InterruptedException
org.apache.flume.ChannelException: Unable to put batch on required channel:
Caused by: org.apache.flume.ChannelException: Space for commit to queue couldn't be acquired Sinks are likely not keeping up with sources, or the buffer size is too tight
 
A subsequent scan of the HBase table would result in fewer rows getting listed than would get streamed if the complete log file got streamed without an exception.
 
 
To avoid the exception increase the  default queue size with the following configuration property in flume.conf.
 
hbase-agent.channels.ch1.capacity = 100000
 
In this tutorial we streamed Oracle Database logs to HBase using Flume.
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Viewing all articles
Browse latest Browse all 318

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>