Best Practices for Amazon EMR - S3 vs HDFS

2013年12月17日星期二

Best Practices for Amazon EMR - S3 vs HDFS

前陣子在看Amazon 出的 white paper：Best Practices for Amazon EMR 的文章時，剛好看到這段 - Data Aggregation Best Practice ，剛好講到在跑MapReduce 使用S3 和HDFS會有怎樣本質上的差異，剛好拿來當做我之前寫的 Object based storage or block device storage 的最明確的實際應用案例。

Map Reduce 在運行的時候，會起幾個Map Task 是根據檔案會被切分成多少個Split來決定，而理論上被切分成多少個Split又是由HDFS的 Block Size來決定，會把一個大檔案分成幾個Chunk，這一切之所以能這樣運行是因為HDFS本身設計就是Block based File system，而理論上一個chunk就剛好是一個HDFS的Block，下圖就是把一個512MB file ，以128MB為一個Chunk來切割，產生相對硬的512/128 =4 個Map Task。

Source:Best Practices for Amazon EMR p14

同樣的MapRecue Job場景搬到EMR，FileSystem換成S3會有怎樣的不一樣呢？首先S3是Object Storage，之前有提過Object Storage有幾個特性：

Object Storage 存取是針對整個物件做動作，像是create Object、 delete object....等
另一個重點是要提供HTTP API (Restful)

所以我們不知道S3底層是怎麼儲存物件，但是我們存取物件都必須是以一整個物件來操作，因此S3FS底層的原理就是把讀取某個Block轉換成HTTP Range Request(一次要求一個Renge 的片段)

Source:Best Practices for Amazon EMR p14

而當要運算一個大檔案的時候，就會變成每個MapTask 個自取讀取他們相對應範圍(or 片段)的物件。(例：GET FILE X Range: byte=0-10000)

Source:Best Practices for Amazon EMR p15

因此使用S3取代HDFS來運算在實務上仍有許多需要注意的地方：

1. Hadoop provides two filesystems that use S3.

S3 Native FileSystem (URI scheme: s3n)

A native filesystem for reading and writing regular files on S3. The advantage of this filesystem is that you can access files on S3 that were written with other tools. Conversely, other tools can access files written using Hadoop. The disadvantage is the 5GB limit on file size imposed by S3. For this reason it is not suitable as a replacement for HDFS (which has support for very large files).

S3 Block FileSystem (URI scheme: s3)

A block-based filesystem backed by S3. Files are stored as blocks, just like they are in HDFS. This permits efficient implementation of renames. This filesystem requires you to dedicate a bucket for the filesystem - you should not use an existing bucket containing files, or write other files to the same bucket. The files stored by this filesystem can be larger than 5GB, but they are not interoperable with other S3 tools.

2. 使用S3的原理就是從S3"拉資料"來運算，在某種程度上就失去了data locality optimization 的優勢。

3. 目前仍有許多問題如：

Block loss in S3FS due to S3 inconsistency on file rename

The symptom of this problem is a file in the filesystem that is an exact multiple of the FS block size - exactly 32MB, 64MB, 96MB, etc. in length. The issue appears to be caused by renaming a file that has recently been written, and getting a stale INode read from S3. When a reducer is writing job output to the S3FS, the normal series of S3 key writes for a 3-block file looks something like this

因此在這篇White Paper有做了以下建議，有興趣可以去看white paper詳細內容：