This guide helps you quickly explore the main features of Delta Lake.
It provides code snippets that show how to read from and write to Delta tables with Amazon EMR.
For more details, check this video
- Create s3 bucket for delta lake (e.g.
learn-deltalake-2022
) - Create EMR Cluster using AWS CDK. (Check details in instructions)
- Open the Amazon EMR console at https://console.aws.amazon.com/elasticmapreduce/
- Create Jupyter Notebook
- Upload
deltalake-with-emr-demo.ipynb
into Jupyter Notebook - Set kernel to PySpark, and Run each cells
- For run Amazn Athena queries on Delta Lake, Check this
-
Amazon EMR Applications
- Hadoop
- Hive
- JupyterHub
- JupyterEnterpriseGateway
- Livy
- Apache Spark (>= 3.0)
-
Apache Spark (PySpark)
{ "conf": { "spark.jars.packages": "io.delta:delta-core_2.12:{version}", "spark.sql.extensions": "io.delta.sql.DeltasparkSessionExtension", "spark.sql.catalog.spark_catalog": "org.apache.spark.sql.delta.catalog.DeltaCatalog", } }
- YOU MUST REPLACE {version} with the appropriate one
- For more details, check this
Delta lake version | Apache Spark version |
---|---|
1.1.x | 3.2.x |
1.0.x | 3.1.x |
0.7.x and 0.8.x | 3.0.x |
Below 0.7.x | 2.4.2 - 2.4.<latest> |
- (video) Incremental Data Processing using Delta Lake with EMR
- (video) DBT + Spark/EMR + Delta Lake/S3
- Compatibility with Apache Spark
- Application versions in Amazon EMR 6.x releases
- Application versions in Amazon EMR 5.x releases
- Delta Core Maven Repository
- Set up Apache Spark with Delta Lake
- Presto and Athena to Delta Lake integration
- Redshift Spectrum to Delta Lake integration
- Support for automatic and incremental Presto/Athena manifest generation (#453)
See CONTRIBUTING for more information.
This library is licensed under the MIT-0 License. See the LICENSE file.