這個Talk同樣是由Cloudera 的工程師 Kathleen Ting 和 Arvind Prabhakar 來輪流報告 (我也不知道為啥他們同一份投影片要輪流穿插講),主講的題目是Mastering Sqoop for Data Transfer for Big Data,不過主要的內容是在講解Sqoop1 的缺陷,以及目前他們在發展的Sqoop2 預計怎麼改上Sqoop1的缺陷。
Sqoop的主要功能就是 SQL to Hadoop and vice versa,而Sqoop1的特色如下:
- Base on connectors
- Responsible for meta data lookup,and data transport
- Majority of connection are JDBC based
- Connectors responsible for all supported functionality
- Hbase import, Avro Support
而 Sqoop1 目前所遇到的挑戰:
- 只使用原生的JDBC,沒有使用每個平台特定的Driver
- 例如Oracle 的Driver 所提供的最佳化。
- 安全問題
- 用明碼記錄帳號密碼,甚至把帳號密碼設定在Config File裡
- Type mapping is not clearly :
- Client needs access to Hadoop binaries configuration and database
- JDBC model is enforced (這樣可能造成要符合Hadoop 模式會有問題)
- Non-uniform funtonality
- Difference connectors support different capability
- Overlap / Duplicated functionality
- Different connectors may implement same capabilities differently
為了解決這些問題,在Cloudera所主導開始了 Sqoop 2 (CDH4)的版本,這個版本主要的改變是多了一個Server 當中間者的架構,很多東西都儲存在Server 那邊,目標就是把sqoop從一個Tool變成一個Service,整個架構改成如下圖所示:
圖片來源:ApacheCon 2013
Design goals
- Ease of Use
- Uniform functionality (不需要specific table?)
- domain specific interactions
- Ease of Extension
- No low-level Hadoop knowledge needed
- No functional overlap between connectors
- Security and separation of Concerns
- Role based access and use
Sqoop提供兩種連線方式 (Connection vs Job meta):
- Connection (distinct per database)
- Job (distinct per table)
- Connectors register metadata
- Metadata enables creation of connections and jobs
- Connections and jobs stored in metadata repository
- Operator runs jobs that use appropriate connections
- Admin set policy for connection use
在當下我不太了解,所以有跟講者確認,她說 Connector 由admin 去產生, 工作由operator 拿這個connector 去工作,這樣就不用洩漏帳號密碼。
Support for secure access to external system via role-based access to connection objects
- Administrator create/edit/delete connections
- Operator use connection (session base?這點待確認)
Usability $ Extensibity:
- Connections and jobs use domain specific inputs (Tables, Operations, etc)
- Domain isolation and thus easy to understand and use
- Connectors work with intermediate Data format
- Any downstream functionality needed is provide by swoop framework
最後幫他們宣傳一下sqoop2 專案還很缺人,現在加入還蠻有機會變commiter唷~XD
[1] Apache Sqoop: Highlights of Sqoop 2
[2] Sqoop2 apache wiki
