Source: Apache Spark JIRA
一轉眼 Apache Spark 3.0-preview 已經推出了,眼看又快要跟時代脫節了...🙈
這次 3.0 的更新多到爆炸,所以再研究有什麼令人興奮的更新前,我們先來複習一下之前在 Spark+AI Summit 2019 Keynote 的預告有什麼需要我們注意的。
- 更多向量矩陣運算和GPU支援
- 跟 K8s 更緊密的整合
- Koalas - panda dataframe
複習完後,我們就可以來 Release Note 挖寶這次有什麼特點呢?翻了一下 Jira issue ,以及mail list - Spark 3.0 preview release feature list and major changes,整理出幾個特點
更多 GPU 的支援(包含 scheduler )
在 3.0 之前 Spark (JVM world)如果想做 ML 最容易被 Python 世界屌打的就是 GPU 的支援,而這個版本開始有了很多的補強。- [SPARK-27373] - Design: Kubernetes support for GPU-aware scheduling
- [SPARK-27378] - spark-submit requests GPUs in YARN mode
- [SPARK-27024] - Executor interface for cluster managers to support GPU resources
- [SPARK-27360] - Standalone cluster mode support for GPU-aware scheduling
- [SPARK-27361] - YARN support for GPU-aware scheduling
- [SPARK-27362] - Kubernetes support for GPU-aware scheduling
- [SPARK-27364] - User-facing APIs for GPU-aware scheduling
- [SPARK-27366] - Spark scheduler internal changes to support GPU scheduling
- [SPARK-27488] - Driver interface to support GPU resources
- [SPARK-27492] - GPU scheduling - High level user documentation
- [SPARK-27897] - GPU Scheduling - move example discovery Script to scripts directory
- [SPARK-20327] - Add CLI support for YARN custom resources, like GPUs
- [SPARK-27948] - GPU Scheduling - Use ResouceName to represent resource names
更多 Kubernetes 整合與支援
- [SPARK-25815] - Kerberos Support in Kubernetes resource manager (Client Mode)
- [SPARK-27362] - Kubernetes support for GPU-aware scheduling
- [SPARK-25960] - Support subpath mounting with Kubernetes
- [SPARK-25222] - Spark on Kubernetes Pod Watcher dumps raw container status
- [SPARK-25828] - Bumping Version of kubernetes.client to latest version
- [SPARK-26194] - Support automatic spark.authenticate secret in Kubernetes backend
- [SPARK-28938] - Move to supported OpenJDK docker image for Kubernetes
ML 相關的支援
- [SPARK-13677] - Support Tree-Based Feature Transformation for ML
- [SPARK-23674] - Add Spark ML Listener for Tracking ML Pipeline Status
- [SPARK-16692] - multilabel classification to DataFrame, ML
關於 Panda dataframe
- [SPARK-28226] - Document Pandas UDF mapParitionsInPandas
- [SPARK-28041] - Increase the minimum pandas version to 0.23.2
- [SPARK-29126] - Add usage guide for cogroup Pandas UDF
DataSource V2
- 更多 query 語法,如 (implement USE CATALOG/NAMESPACE for Data Source V2)
其他特點
- [SPARK-25560] - Allow Function Injection in SparkSessionExtensions
- [SPARK-11150] - Dynamic partition pruning
- [SPARK-28739] - Add a simple cost check for Adaptive Query Execution
重要套件的升級
- [SPARK-24417] - Build and Run Spark on JDK11
- [SPARK-25079] - [PYTHON] upgrade python 3.4 -> 3.6
- [SPARK-24516] - PySpark Bindings for K8S - make Python 3 the default
- [SPARK-24360] - Support Hive 3.1 metastore
- [SPARK-26014] - Deprecate R < 3.4 support
- [SPARK-26132] - Remove support for Scala 2.11 in Spark 3.0.0
延伸閱讀:
沒有留言:
張貼留言