Apache Spark 學習三部曲：學會他，除錯他，寫好他

2018年4月24日星期二

最近密集的在寫Spark 程式，感覺到終於該開始往下個階段邁進了，其實就像學習任何程式語言和Framework，Spark 學習也分三個步驟：

學會如何寫，網路上有不少的範例，不過大多是Scala和Python，如果要翻成Java版還需要額外花點功夫，等到開始寫一些程式丟到spark 上面跑，又會開始遇到一堆奇奇怪怪的錯誤訊息，比如說：

Futures timed out after [300 seconds]
This timeout is controlled by spark.executor.heartbeatInterval
cannot assign instance of java.lang.invoke.SerializedLambda to field org.apache.spark.sql.UDFRegistration$$anonfun$27.f
Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory

這時候就要開始學習怎麼調教系統參數和錯誤排除，下面這個網站整理的蠻不錯的，把Spark performance tuning 分成幾大塊，分別是：