Knowledge: Best Practice of Elephant-Bird and Hive with Protocol Buffer

前言
在Data pipeline中透過Camus把資料倒到HDFS之後, 接著就是需要把資料引導到Hive, 使其做些分析. 因為在HDFS的raw data都是Protocol Buffer格式, Elephant-Bird 有實作Hive專用的Protocol Buffer Decoder, 可以直接使用, 只不過Protocol Buffer原生沒有BigDecimal格式, 得自己去實作, 這部分就得Hack到source code了

Install Hive

brew install hive

(but need unlink apache-spark first, they are conflicted)

(這時候我裝的版本是0.14.0)

Build Elephant-Bird jar

官方(https://github.com/twitter/elephant-bird)的Quickstart 說明蠻簡單明瞭

3, 4, 5項目比較單純, 只要1 ,2 提到的需求環境有建置好

但是 1, 2 步驟就讓我耗了蠻多時間

有問題的人可以參考我怎麼處理1, 2,步驟

環境 MacBook Air ,OS Yosemite 10.10.2

Protocol Buffer 2.4.1

brew install protobuf241

Apache Thrift 0.7.0

brew install boost
brew install autoconf automake libtool pkg-config libevent
download thrift 0.7.0 (https://archive.apache.org/dist/thrift/0.7.0/)
unpack it

tar -xvf thrift-0.7.0.tar.gz

chmod 775 whole thrift-0.7.0 folder
get into thrift-0.7.0 folder
sudo bash configure --with-boost=/usr/local/Cellar --with-libevent=/usr/local/Cellar --without-lua --without-php --without-cpp --without-c_glib --without-python --without-ruby --without-perl

(Solve : src/concurrency/ThreadManager.h:24:10: fatal error: 'tr1/functional' file not found)

chmod 775 whole thrift-0.7.0 folder
sudo make
chmod 775 whole thrift-0.7.0 folder
sudo make install

環境準備好接著就把Elephant-Bird clone下來

brew install maven (才有辦法build mvn project)
get into elephant-bird project folder
mvn package -DskipTests (跳過unitest部分比較快)

Build完elephant-bird 之後就可以得到

elephant-bird/core/target/elephant-bird-core-4.6-SNAPSHOT.jar
elephant-bird/hadoop-compat/target/elephant-bird-hadoop-compat-4.6-SNAPSHOT.jar
elephant-bird/hive/target/elephant-bird-hive-4.6-SNAPSHOT.jar

這三個 jar 檔

把這三個放到(我local mac Hive為例)

/usr/local/Cellar/hive/0.14.0/libexec/lib 之下

就可以使得Hive 去根據Protocol Buffer格式去建table, 並且把hdfs中Protocol Buffer raw data 讀進table 中

可參考我在Hive建Protocol Buffer table example:

Example of build protocol buffer jar and put into hive

使用Protocol Buffer 格式：addressbook.proto

 package tutorial;  
 option java_package = "com.example.tutorial";  
 option java_outer_classname = "AddressBookProtos";  
 message Person {  
  required string name = 1;  
  required int32 id = 2;  
  optional string email = 3;  
  enum PhoneType {  
   MOBILE = 0;  
   HOME = 1;  
   WORK = 2;  
  }  
  message PhoneNumber {  
   required string number = 1;  
   optional PhoneType type = 2 [default = HOME];  
  }  
  repeated PhoneNumber phone = 4;  
 }  
 message AddressBook {  
  repeated Person person = 1;  
 }

compile proto file to java

在有addressbook.proto的路徑下執行
protoc -I=. --java_out=. ./addressbook.proto

compile java file to class

先確定環境數有protobuf-java-2.4.1.jar
(沒有的話加入export CLASSPATH="path to protobuf-java-2.4.1.jar")
接著進入com/example/tutorial之下
javac AddressBookProtos.java
會得到AddressBookProtos一堆相關class檔

建立jar檔

回到最外層(com之上層folder)
jar cf addressbook.jar com

把addressbook.jar 放到hive的

/usr/local/Cellar/hive/0.14.0/libexec/lib之下
可以用 jar tf addressbook.jar 指令觀看 jar 檔有無包到你要的class

正常會顯示

 META-INF/  
 META-INF/MANIFEST.MF  
 com/  
 com/example/  
 com/example/tutorial/  
 com/example/tutorial/AddressBookProtos$1.class  
 com/example/tutorial/AddressBookProtos$AddressBook$Builder.class  
 com/example/tutorial/AddressBookProtos$AddressBook.class  
 com/example/tutorial/AddressBookProtos$AddressBookOrBuilder.class  
 com/example/tutorial/AddressBookProtos$Person$Builder.class  
 com/example/tutorial/AddressBookProtos$Person$PhoneNumber$Builder.class  
 com/example/tutorial/AddressBookProtos$Person$PhoneNumber.class  
 com/example/tutorial/AddressBookProtos$Person$PhoneNumberOrBuilder.class  
 com/example/tutorial/AddressBookProtos$Person$PhoneType$1.class  
 com/example/tutorial/AddressBookProtos$Person$PhoneType.class  
 com/example/tutorial/AddressBookProtos$Person.class  
 com/example/tutorial/AddressBookProtos$PersonOrBuilder.class  
 com/example/tutorial/AddressBookProtos.class  
 com/example/tutorial/AddressBookProtos.java

接著進入Hive console就可以建Protocol buffer table

 create external table addressbook  
  row format serde "com.twitter.elephantbird.hive.serde.ProtobufDeserializer"  
  with serdeproperties (  
   "serialization.class"="com.example.tutorial.AddressBookProtos$AddressBook")  
  stored as  
   inputformat "org.apache.hadoop.mapred.SequenceFileInputFormat"  
   outputformat "org.apache.hadoop.mapred.SequenceFileOutputFormat" ;

(PS:
可以注意到inputformat是org.apache.hadoop.mapred.SequenceFileInputFormat ,
這是因為前一篇的camus文章有提到存成的檔案格式是SequenceFileInputFormat,
這邊你可以用自己的檔案格式, 只要有符合hdfs 的rwa data當初存的format )

應該會顯示

 OK  
 Time taken: 0.078 seconds

然後可以過指令：

 describe addressbook;   
 OK   
 person      array<struct<name:string,id:int,email:string,phone:array<struct<number:string,type:string>>>>   from deserializer   
 Time taken: 0.495 seconds, Fetched: 1 row(s)

可以看到Hive table 對應到Protocol Buffer 的巢狀欄位結構

然後可以透過LOAD 指令讀入hdfs data

 LOAD DATA INPATH 'hdfs data path' OVERWRITE INTO TABLE addressbook;

大致上基本的Hive + Elephant-Bird 的建置過程就是這樣
如果要進一步處理BigDecimal 欄位 in Protocol Buffer 會有另一篇待續
再者Elephant-Bird有一些map reduce issue要解決, 也會有另一篇分享我遇到的問題
待續

Knowledge

2015年2月23日星期一

Best Practice of Elephant-Bird and Hive with Protocol Buffer

Install Hive

Build Elephant-Bird jar

Example of build protocol buffer jar and put into hive

沒有留言:

張貼留言

2015年2月23日 星期一

Best Practice of Elephant-Bird and Hive with Protocol Buffer

Install Hive

Build Elephant-Bird jar

Example of build protocol buffer jar and put into hive

沒有留言:

張貼留言

2015年2月23日星期一