Saturday, August 12, 2017

Big Data(二):Hadoop Ecosystem 2

Big Data(二):Hadoop Ecosystem 2


上次 [1] 講到實現大數據的分析,主流工具是 Hadoop Ecosystem。而 Hadoop 的核心,則是 MapReduce 與 HDFS。

本次會多介紹一些應用上常用的「元件」,包括 HBase、Hive、Pig 與 Mahout。


Fig.1. Hadoop Ecosystem, p. 10 [2].

Fig.2. A Brief description of Hadoop Ecosystem, p. 10 [2].


圖1摘自 [2],是另一份 Hadoop Ecosystem 的簡介。文字說明放在文章最後。

圖1的左邊顯示資料如何進到 Hadoop,右邊則顯示資料分析後輸出途徑。

中間是生態系最主要的元件,可以看到 MapReduce 與 HDFS 以灰階效果強調。


1. HBase

HBase 參考 BigTable 的設計,同屬 NoSQL 的 Key-Value 資料庫。Key-Value,類似字典的概念。Key 好比單字,Value 則是單字的解釋、例句等,甚至圖片,內容很彈性。

「簡單的說,HBase 跟 BigTable 一樣,有別於一般資料庫系統用的 row-oriented 儲存方式,這兩個系統都是 column-oriented 的儲存方式。Column-oriented 的好處是每一筆資料可以存放不固定欄位的東西,不像 row-oriented 的存法,增加一個新的欄位需要多花些力氣才能達成。」[3]

「Key-Value資料庫是NoSQL資料庫中最大宗的類型,這類資料最大的特色就是採用Key-Value資料架構,取消了原本關聯式資料庫中常用的欄位架構(Schema),每筆資料各自獨立,所以,可以打造出分散式和高擴充能力的特性。 包括像Google的BigTable、Hadoop的HBase、Amazon的Dynamo、Cassandra、Hypertable等都是這類Key-Value資料庫。 」[4]


2. Hive, Pig, and SQL

前面講到 Hbase 是 Hadoop 上的 NoSQL 資料庫。而 Hive 與 Pig 分別利用 HiveQL 與 Pig Latin 兩種語言把 MapReduce / HDFS 「包裝」成 SQL 資料庫。

有關 Hive, Pig 與 SQL 的差別,可以參考一篇很棒的文章 [5],底下直接摘錄重點譯文 [6]。

2.1. Hive

什麼時候用 Apache Hive

「有時我們需要收集一段時間的數據來進行分析,而Hive就是分析歷史數據絕佳的工具。要註意的是數據必須有一定的結構才能充分發揮Hive的功能。用Hive來進行實時分析可能就不是太理想了,因為它不能達到實時分析的速度要求(實時分析可以用HBase,Facebook用的就是HBase)」 [6]。

2.2. Pig

什麼時候用 Apache Pig

「當你需要處理非格式化的分散式數據集時,如果想充分利用自己的SQL基礎,可以選擇Pig。使用Pig你無需自己構建MapReduce任務,有SQL背景的話學習起來比較簡單,開發速度也很快」 [6]。


3. Mahout

Mahout 可以用來實現許多 Machine Learning 演算法,如分類、分群、推薦系統等。



本文簡介幾個架構在 MapReduce / HDFS 上的大數據服務,包含 NoSQL 的 HBase、SQL-like 的 Hive、Pig,以及機器學習的 Mahout,希望讀者可以對 Hadoop Ecosystem 有初步的認識。


Hadoop Ecosystem

Hadoop common
Hadoop common is a collection of components and interfaces for the foundation of
Hadoop-based Big Data platforms. It provides the following components:
 Distributed filesystem and I/O operation interfaces
 General parallel computation interfaces
 Security management

Apache HBase
Apache HBase is an open source, distributed, versioned, and column-oriented data store.
It was built on top of Hadoop and HDFS. HBase supports random, real-time access to Big
Data. It can scale to host very large tables, containing billions of rows and millions of columns.
More documentation about HBase can be obtained from

Apache Mahout
Apache Mahout is an open source scalable machine learning library based on Hadoop. It has
a very active community and is still under development. Currently, the library supports four use
cases: recommendation mining, clustering, classification, and frequent item set mining.
More documentation of Mahout can be obtained from

Apache Pig
Apache Pig is a high-level system for expressing Big Data analysis programs. It supports Big
Data by compiling the Pig statements into a sequence of MapReduce jobs. Pig uses Pig Latin
as the programming language, which is extensible and easy to use. More documentation
about Pig can be found from

Apache Hive
Apache Hive is a high-level system for the management and analysis of Big Data stored in
Hadoop-based systems. It uses a SQL-like language called HiveQL. Similar to Apache Pig, the
Hive runtime engine translates HiveQL statements into a sequence of MapReduce jobs for
execution. More information about Hive can be obtained from

Apache ZooKeeper
Apache ZooKeeper is a centralized coordination service for large scale distributed systems. It
maintains the configuration and naming information and provides distributed synchronization
and group services for applications in distributed systems. More documentation about
ZooKeeper can be obtained from

Apache Oozie
Apache Oozie is a scalable workflow management and coordination service for Hadoop
jobs. It is data aware and coordinates jobs based on their dependencies. In addition, Oozie
has been integrated with Hadoop and can support all types of Hadoop jobs. More information
about Oozie can be obtained from

Apache Sqoop
Apache Sqoop is a tool for moving data between Apache Hadoop and structured data stores
such as relational databases. It provides command-line suites to transfer data from relational
database to HDFS and vice versa. More information about Apache Sqoop can be found at

Apache Flume
Apache Flume is a tool for collecting log data in distributed systems. It has a flexible
yet robust and fault tolerant architecture that streams data from log servers to Hadoop.
More information can be obtained from

Apache Avro
Apache Avro is a fast, feature rich data serialization system for Hadoop. The serialized data
is coupled with the data schema, which facilitates its processing with different programming
languages. More information about Apache Avro can be found at



[1] Big Data(一):Hadoop Ecosystem 1 

[2] 2013_Hadoop Operations and Cluster Management Cookbook, pp. 12-13

[3] HBase 介紹 - Hadoop Taiwan User Group 

[4] 快速認識4類主流NoSQL資料庫 _ iThome 

[5] Pig vs Hive vs SQL – Difference between the Big Data Tools - Hadoop360 

[6] 對比Pig、Hive和SQL,淺看大數據工具之間的差異

No comments: