在Data pipeline中透過Camus把資料倒到HDFS之後, 接著就是需要把資料引導到Hive, 使其做些分析. 因為在HDFS的raw data都是Protocol Buffer格式, Elephant-Bird 有實作Hive專用的Protocol Buffer Decoder, 可以直接使用, 只不過Protocol Buffer原生沒有BigDecimal格式, 得自己去實作, 這部分就得Hack到source code了
Install Hive
brew install hive
(but need unlink apache-spark first, they are conflicted)
(這時候我裝的版本是0.14.0)
Build Elephant-Bird jar
官方(https://github.com/twitter/elephant-bird)的Quickstart 說明蠻簡單明瞭
3, 4, 5項目比較單純, 只要1 ,2 提到的需求環境有建置好
但是 1, 2 步驟就讓我耗了蠻多時間
有問題的人可以參考我怎麼處理1, 2,步驟
環境 MacBook Air ,OS Yosemite 10.10.2
- Protocol Buffer 2.4.1
- brew install protobuf241
- Apache Thrift 0.7.0
- brew install boost
- brew install autoconf automake libtool pkg-config libevent
- download thrift 0.7.0 (https://archive.apache.org/dist/thrift/0.7.0/)
- unpack it
- tar -xvf thrift-0.7.0.tar.gz
- chmod 775 whole thrift-0.7.0 folder
- get into thrift-0.7.0 folder
- sudo bash configure --with-boost=/usr/local/Cellar --with-libevent=/usr/local/Cellar --without-lua --without-php --without-cpp --without-c_glib --without-python --without-ruby --without-perl
- (Solve : src/concurrency/ThreadManager.h:24:10: fatal error: 'tr1/functional' file not found)
- chmod 775 whole thrift-0.7.0 folder
- sudo make
- chmod 775 whole thrift-0.7.0 folder
- sudo make install
環境準備好接著就把Elephant-Bird clone下來
- brew install maven (才有辦法build mvn project)
- get into elephant-bird project folder
- mvn package -DskipTests (跳過unitest部分比較快)
Build完elephant-bird 之後就可以得到
- elephant-bird/core/target/elephant-bird-core-4.6-SNAPSHOT.jar
- elephant-bird/hadoop-compat/target/elephant-bird-hadoop-compat-4.6-SNAPSHOT.jar
- elephant-bird/hive/target/elephant-bird-hive-4.6-SNAPSHOT.jar
這三個 jar 檔
把這三個放到(我local mac Hive為例)
/usr/local/Cellar/hive/0.14.0/libexec/lib 之下
就可以使得Hive 去根據Protocol Buffer格式去建table, 並且把hdfs中Protocol Buffer raw data 讀進table 中
可參考我在Hive建Protocol Buffer table example:
Example of build protocol buffer jar and put into hive
- 使用Protocol Buffer 格式:addressbook.proto
package tutorial;
option java_package = "com.example.tutorial";
option java_outer_classname = "AddressBookProtos";
message Person {
required string name = 1;
required int32 id = 2;
optional string email = 3;
enum PhoneType {
MOBILE = 0;
HOME = 1;
WORK = 2;
}
message PhoneNumber {
required string number = 1;
optional PhoneType type = 2 [default = HOME];
}
repeated PhoneNumber phone = 4;
}
message AddressBook {
repeated Person person = 1;
}
- compile proto file to java
- 在有addressbook.proto的路徑下執行
- protoc -I=. --java_out=. ./addressbook.proto
- compile java file to class
- 先確定環境數有protobuf-java-2.4.1.jar
- (沒有的話加入export CLASSPATH="path to protobuf-java-2.4.1.jar")
- 接著進入com/example/tutorial之下
- javac AddressBookProtos.java
- 會得到AddressBookProtos一堆相關class檔
- 建立jar檔
- 回到最外層(com之上層folder)
- jar cf addressbook.jar com
- 把addressbook.jar 放到hive的
- /usr/local/Cellar/hive/0.14.0/libexec/lib之下
- 可以用 jar tf addressbook.jar 指令觀看 jar 檔有無包到你要的class
META-INF/
META-INF/MANIFEST.MF
com/
com/example/
com/example/tutorial/
com/example/tutorial/AddressBookProtos$1.class
com/example/tutorial/AddressBookProtos$AddressBook$Builder.class
com/example/tutorial/AddressBookProtos$AddressBook.class
com/example/tutorial/AddressBookProtos$AddressBookOrBuilder.class
com/example/tutorial/AddressBookProtos$Person$Builder.class
com/example/tutorial/AddressBookProtos$Person$PhoneNumber$Builder.class
com/example/tutorial/AddressBookProtos$Person$PhoneNumber.class
com/example/tutorial/AddressBookProtos$Person$PhoneNumberOrBuilder.class
com/example/tutorial/AddressBookProtos$Person$PhoneType$1.class
com/example/tutorial/AddressBookProtos$Person$PhoneType.class
com/example/tutorial/AddressBookProtos$Person.class
com/example/tutorial/AddressBookProtos$PersonOrBuilder.class
com/example/tutorial/AddressBookProtos.class
com/example/tutorial/AddressBookProtos.java
接著進入Hive console就可以建Protocol buffer table
create external table addressbook
row format serde "com.twitter.elephantbird.hive.serde.ProtobufDeserializer"
with serdeproperties (
"serialization.class"="com.example.tutorial.AddressBookProtos$AddressBook")
stored as
inputformat "org.apache.hadoop.mapred.SequenceFileInputFormat"
outputformat "org.apache.hadoop.mapred.SequenceFileOutputFormat" ;
(PS:可以注意到inputformat是org.apache.hadoop.mapred.SequenceFileInputFormat ,
這是因為前一篇的camus文章有提到存成的檔案格式是SequenceFileInputFormat,
這邊你可以用自己的檔案格式, 只要有符合hdfs 的rwa data當初存的format )
應該會顯示
OK
Time taken: 0.078 seconds
然後可以過指令:
可以看到Hive table 對應到Protocol Buffer 的巢狀欄位結構
describe addressbook;
OK
person array<struct<name:string,id:int,email:string,phone:array<struct<number:string,type:string>>>> from deserializer
Time taken: 0.495 seconds, Fetched: 1 row(s)
可以看到Hive table 對應到Protocol Buffer 的巢狀欄位結構
然後可以透過LOAD 指令讀入hdfs data
LOAD DATA INPATH 'hdfs data path' OVERWRITE INTO TABLE addressbook;
大致上基本的Hive + Elephant-Bird 的建置過程就是這樣
如果要進一步處理BigDecimal 欄位 in Protocol Buffer 會有另一篇待續
再者Elephant-Bird有一些map reduce issue要解決, 也會有另一篇分享我遇到的問題
待續
沒有留言:
張貼留言