3 Key techniques, to optimize your Apache Spark code
- Intro
- Distributed Systems
- Setup
- Optimizing your spark code
- Technique 1: reduce data shuffle
- Technique 2. Use caching, when necessary - ...
Similar Articles (10 found)
🔍 65.2% similar
How to submit Spark jobs to EMR cluster from Airflow
Table of Contents
Introduction
I have been asked and seen the questions
how others are automating...
🔍 62.9% similar
How to trigger a spark job from AWS Lambda
- Event driven pipelines
- Lambda function to trigger spark jobs
- Setup and run
- Monitoring and logging
-...
🔍 62.2% similar
What do Snowflake, Databricks, Redshift, BigQuery actually do?
- 1. Introduction
- 2. Analytical databases aggregate large amounts of data
- 3. Most p...
🔍 61.3% similar
Building Cost Efficient Data Pipelines with Python & DuckDB
- 1. Introduction
- 2. Project demo
- 3. TL;DR
- 4. Considerations when building pipelines...
🔍 59.8% similar
Member-only story
Mastering Hadoop, Part 3: Hadoop Ecosystem: Get the most out of your cluster
Exploring the Hadoop ecosystem — key tools to maximize ...
🔍 59.6% similar
How to quickly set up a local Spark development environment?
- 1. Introduction
- 2. Setup
- 3. Use VSCode devcontainers to set up Spark environment
- ...
🔍 59.1% similar
How to improve at SQL as a data engineer
- 1. Introduction
- 2. SQL skills
- 3. Practice
- 4. Conclusion
- 5. Further reading
- 6. References
1. Intro...
🔍 58.5% similar
How to quickly deliver data to business users? #1. Adv Data types & Schema evolution
- 1. Introduction
- 2. Use Schema evolution & advanced data types...
🔍 58.3% similar
Data Engineering Best Practices - #1. Data flow & Code
- 1. Introduction
- 2. Sample project
- 3. Best practices
- 3.1. Use standard patterns that pro...
🔍 57.9% similar
Data Engineering Project for Beginners - Batch edition
- 1. Introduction
- 2. Objective
- 3. Run Data Pipeline
- 4. Architecture
- 5. Code walkthrough...