2015年2月23日 星期一

Best Practice of Elephant-Bird and Hive with Protocol Buffer

前言
在Data pipeline中透過Camus把資料倒到HDFS之後, 接著就是需要把資料引導到Hive, 使其做些分析. 因為在HDFS的raw data都是Protocol Buffer格式, Elephant-Bird 有實作Hive專用的Protocol Buffer Decoder, 可以直接使用, 只不過Protocol Buffer原生沒有BigDecimal格式, 得自己去實作, 這部分就得Hack到source code了



Install Hive

brew install hive
(but need unlink apache-spark first, they are conflicted)
(這時候我裝的版本是0.14.0)

Build Elephant-Bird jar

官方(https://github.com/twitter/elephant-bird)的Quickstart 說明蠻簡單明瞭

3, 4, 5項目比較單純, 只要1 ,2 提到的需求環境有建置好
但是 1, 2 步驟就讓我耗了蠻多時間
有問題的人可以參考我怎麼處理1, 2,步驟

環境 MacBook Air ,OS Yosemite 10.10.2 
  1. Protocol Buffer 2.4.1
    • brew install protobuf241 
  2. Apache Thrift 0.7.0
    1. brew install boost
    2. brew install autoconf automake libtool pkg-config libevent
    3. download thrift 0.7.0 (https://archive.apache.org/dist/thrift/0.7.0/)
    4. unpack it  
      • tar -xvf thrift-0.7.0.tar.gz
    5. chmod 775 whole thrift-0.7.0 folder
    6. get into thrift-0.7.0 folder
    7. sudo bash configure --with-boost=/usr/local/Cellar --with-libevent=/usr/local/Cellar --without-lua --without-php --without-cpp --without-c_glib --without-python --without-ruby --without-perl
      • (Solve : src/concurrency/ThreadManager.h:24:10: fatal error: 'tr1/functional' file not found)
    8. chmod 775 whole thrift-0.7.0 folder
    9. sudo make
    10. chmod 775 whole thrift-0.7.0 folder
    11. sudo make install

環境準備好接著就把Elephant-Bird  clone下來
    1. brew install maven (才有辦法build mvn project)
    2. get into elephant-bird project folder
    3. mvn package -DskipTests (跳過unitest部分比較快)
Build完elephant-bird 之後就可以得到
  • elephant-bird/core/target/elephant-bird-core-4.6-SNAPSHOT.jar
  • elephant-bird/hadoop-compat/target/elephant-bird-hadoop-compat-4.6-SNAPSHOT.jar
  • elephant-bird/hive/target/elephant-bird-hive-4.6-SNAPSHOT.jar
這三個 jar 檔
把這三個放到(我local mac Hive為例)
/usr/local/Cellar/hive/0.14.0/libexec/lib 之下
就可以使得Hive 去根據Protocol Buffer格式去建table, 並且把hdfs中Protocol Buffer raw data 讀進table 中

可參考我在Hive建Protocol Buffer table example:

Example of build protocol buffer jar and put into hive

  • 使用Protocol Buffer 格式:addressbook.proto
 package tutorial;  
 option java_package = "com.example.tutorial";  
 option java_outer_classname = "AddressBookProtos";  
 message Person {  
  required string name = 1;  
  required int32 id = 2;  
  optional string email = 3;  
  enum PhoneType {  
   MOBILE = 0;  
   HOME = 1;  
   WORK = 2;  
  }  
  message PhoneNumber {  
   required string number = 1;  
   optional PhoneType type = 2 [default = HOME];  
  }  
  repeated PhoneNumber phone = 4;  
 }  
 message AddressBook {  
  repeated Person person = 1;  
 }  
  • compile proto file to java
    • 在有addressbook.proto的路徑下執行
    • protoc -I=. --java_out=. ./addressbook.proto
  • compile java file to class
    • 先確定環境數有protobuf-java-2.4.1.jar
    • (沒有的話加入export CLASSPATH="path to protobuf-java-2.4.1.jar")
    • 接著進入com/example/tutorial之下
    • javac AddressBookProtos.java
    • 會得到AddressBookProtos一堆相關class檔
  • 建立jar檔
    • 回到最外層(com之上層folder)
    • jar cf addressbook.jar com
  • 把addressbook.jar 放到hive的
    • /usr/local/Cellar/hive/0.14.0/libexec/lib之下
    • 可以用 jar tf addressbook.jar 指令觀看 jar 檔有無包到你要的class
正常會顯示
 META-INF/  
 META-INF/MANIFEST.MF  
 com/  
 com/example/  
 com/example/tutorial/  
 com/example/tutorial/AddressBookProtos$1.class  
 com/example/tutorial/AddressBookProtos$AddressBook$Builder.class  
 com/example/tutorial/AddressBookProtos$AddressBook.class  
 com/example/tutorial/AddressBookProtos$AddressBookOrBuilder.class  
 com/example/tutorial/AddressBookProtos$Person$Builder.class  
 com/example/tutorial/AddressBookProtos$Person$PhoneNumber$Builder.class  
 com/example/tutorial/AddressBookProtos$Person$PhoneNumber.class  
 com/example/tutorial/AddressBookProtos$Person$PhoneNumberOrBuilder.class  
 com/example/tutorial/AddressBookProtos$Person$PhoneType$1.class  
 com/example/tutorial/AddressBookProtos$Person$PhoneType.class  
 com/example/tutorial/AddressBookProtos$Person.class  
 com/example/tutorial/AddressBookProtos$PersonOrBuilder.class  
 com/example/tutorial/AddressBookProtos.class  
 com/example/tutorial/AddressBookProtos.java  

接著進入Hive console就可以建Protocol buffer table
 create external table addressbook  
  row format serde "com.twitter.elephantbird.hive.serde.ProtobufDeserializer"  
  with serdeproperties (  
   "serialization.class"="com.example.tutorial.AddressBookProtos$AddressBook")  
  stored as  
   inputformat "org.apache.hadoop.mapred.SequenceFileInputFormat"  
   outputformat "org.apache.hadoop.mapred.SequenceFileOutputFormat" ;  
(PS:
可以注意到inputformat是org.apache.hadoop.mapred.SequenceFileInputFormat ,
這是因為前一篇的camus文章有提到存成的檔案格式是SequenceFileInputFormat,
這邊你可以用自己的檔案格式, 只要有符合hdfs 的rwa data當初存的format )

應該會顯示
 OK  
 Time taken: 0.078 seconds  

然後可以過指令:
 describe addressbook;   
 OK   
 person      array<struct<name:string,id:int,email:string,phone:array<struct<number:string,type:string>>>>   from deserializer   
 Time taken: 0.495 seconds, Fetched: 1 row(s)  

可以看到Hive table 對應到Protocol Buffer 的巢狀欄位結構

然後可以透過LOAD 指令讀入hdfs data
 LOAD DATA INPATH 'hdfs data path' OVERWRITE INTO TABLE addressbook;  

大致上基本的Hive + Elephant-Bird 的建置過程就是這樣
如果要進一步處理BigDecimal 欄位 in Protocol Buffer 會有另一篇待續
再者Elephant-Bird有一些map reduce issue要解決, 也會有另一篇分享我遇到的問題
待續



沒有留言:

張貼留言