你知道在 Azure 上有幾種 On Demand 啟動 Spark 的方法嗎？

2020年4月20日星期一

你知道在 Azure 上有幾種 On Demand 啟動 Spark 的方法嗎？

Source: Provision on-demand Spark clusters on Docker using Azure

最近需要開始分析一些 Log ，最直覺的方式就是使用最熟悉的 Spark 來分析，於是開始研究最近有什麼方便在 Azure 啟動 Spark 的方式，在 AWS 和 GCP 上，之前就已經有研究過專門支援的 PaaS 服務：

AWS 就是 EMR ( 關於AWS Elaster MapReducer )
GCP 就是 Dataproc - Google Cloud Dataproc 如何建立 Custom Image 加快 PySpark 部署環境速度

我知道 Azure 有 HDInsight ，但是之前使用覺得沒有 GCP Dataproc 好用，不知道 2020 年的今天，有沒有什麼新的 Solution 呢？畢竟 Azure 的 "強項" 就是透過大量跟 3rd-party ISV 整合來壯大自己的服務

一查下去真的是馬上觸發我得資訊焦慮和選擇障礙....

光是從這篇文章 - Dear Spark developers: Welcome to Azure Cognitive Services，就可以看到有好多種玩法：

Cognitive Services on Spark enable working with Azure’s Intelligent Services at massive scales with the Apache Spark™ distributed computing ecosystem. The Cognitive Services on Spark are compatible with any Spark 2.4 cluster such as Azure Databricks, Azure Distributed Data Engineering Toolkit (AZTK) on Azure Batch, Spark in SQL Server, and Spark clusters on Azure Kubernetes Service. Furthermore, we provide idiomatic bindings in PySpark, Scala, Java, and R (Beta).

除了使用 HDInsight 外，目前還有以下幾種方法：

Azure Databricks - 類似 HDInsight

Azure 與 Spark 的爸爸 Databricks 合作，透過 Azure Data Factory 直接啟動一整包 Databricks 整合與設定好的 Spark Cluster 在Azure 上面。

此外和 HDInsight 一樣，Azure 這次更把 Livy 也整合進來，假設你是 long run 的 Spark cluster 就可以透過 REST API 直接 submit Job

部署 Spark 在 AKS 上

如果已經有現成的 AKS 相信這也是一種方法，直接把 Spark 透過 Docker 部署到 AKS 上面，省去開機器和管理得時間。

使用 Azure Distributed Data Engineering Toolkit (AZTK) 部署到 Azure Batch

這個方法比較像 GCP Dataproc 的方式，透過下載 Azure Distributed Data Engineering Toolkit (AZTK) 的 command line tool ，每次需要時就透過 command line 起一組 cluster ，並且把程式 submit 過去執行，執行完成就自己把自己砍掉，應該最省成本。

Reference: