2019年11月11日 星期一

Apache Spark 3.0 Release Preview






一轉眼 Apache Spark 3.0-preview 已經推出了,眼看又快要跟時代脫節了...🙈
這次 3.0 的更新多到爆炸,所以再研究有什麼令人興奮的更新前,我們先來複習一下之前在 Spark+AI Summit 2019 Keynote 的預告有什麼需要我們注意的。

  • 更多向量矩陣運算和GPU支援
  • 跟 K8s 更緊密的整合
  • Koalas - panda dataframe



複習完後,我們就可以來 Release Note  挖寶這次有什麼特點呢?翻了一下 Jira issue ,以及mail list - Spark 3.0 preview release feature list and major changes,整理出幾個特點


更多 GPU 的支援(包含 scheduler )

在 3.0 之前 Spark (JVM world)如果想做 ML 最容易被 Python 世界屌打的就是 GPU 的支援,而這個版本開始有了很多的補強。

  • [SPARK-27373] - Design: Kubernetes support for GPU-aware scheduling
  • [SPARK-27378] - spark-submit requests GPUs in YARN mode
  • [SPARK-27024] - Executor interface for cluster managers to support GPU resources
  • [SPARK-27360] - Standalone cluster mode support for GPU-aware scheduling
  • [SPARK-27361] - YARN support for GPU-aware scheduling
  • [SPARK-27362] - Kubernetes support for GPU-aware scheduling
  • [SPARK-27364] - User-facing APIs for GPU-aware scheduling
  • [SPARK-27366] - Spark scheduler internal changes to support GPU scheduling
  • [SPARK-27488] - Driver interface to support GPU resources 
  • [SPARK-27492] - GPU scheduling - High level user documentation
  • [SPARK-27897] - GPU Scheduling - move example discovery Script to scripts directory
  • [SPARK-20327] - Add CLI support for YARN custom resources, like GPUs
  • [SPARK-27948] - GPU Scheduling - Use ResouceName to represent resource names

更多 Kubernetes 整合與支援


  • [SPARK-25815] - Kerberos Support in Kubernetes resource manager (Client Mode)
  • [SPARK-27362] - Kubernetes support for GPU-aware scheduling
  • [SPARK-25960] - Support subpath mounting with Kubernetes
  • [SPARK-25222] - Spark on Kubernetes Pod Watcher dumps raw container status
  • [SPARK-25828] - Bumping Version of kubernetes.client to latest version
  • [SPARK-26194] - Support automatic spark.authenticate secret in Kubernetes backend
  • [SPARK-28938] - Move to supported OpenJDK docker image for Kubernetes

ML 相關的支援

  • [SPARK-13677] - Support Tree-Based Feature Transformation for ML
  • [SPARK-23674] - Add Spark ML Listener for Tracking ML Pipeline Status
  • [SPARK-16692] - multilabel classification to DataFrame, ML

關於 Panda dataframe

  • [SPARK-28226] - Document Pandas UDF mapParitionsInPandas
  • [SPARK-28041] - Increase the minimum pandas version to 0.23.2
  • [SPARK-29126] - Add usage guide for cogroup Pandas UDF


 DataSource V2

  •  更多 query 語法,如 (implement USE CATALOG/NAMESPACE for Data Source V2)

其他特點

  • [SPARK-25560] - Allow Function Injection in SparkSessionExtensions

  • [SPARK-11150] - Dynamic partition pruning
  • [SPARK-28739] - Add a simple cost check for Adaptive Query Execution

重要套件的升級




延伸閱讀:

沒有留言 :