Showing 20 of 636 URL(s)
(Page 16 of 32)
Data Pipeline Design Patterns - #2. Coding patterns in Python
Data Pipeline Design Patterns - #2. Coding patterns in Python
- Introduction
- Sample project
- Code design patterns
- Python helpers
- Misc
- Conclusion
- Further reading
- References
Introduction
Using the appropriate code design pattern can make your code easy to read, extensible, and seamless to...
π‘ Top Recommendations:
Change Data Capture, with Debezium
Change Data Capture, with Debezium
Introduction
Change data capture is a pattern where every change to a row in a table is captured and sent to downstream systems. If you have wondered
How to ingest data from multiple databases into your data warehouse?
How to make data available for analytical quer...
π‘ Top Recommendations:
How to become a valuable data engineer
How to become a valuable data engineer
1. Introduction
So you are a new data engineer (or looking for a DE job) and want to better yourself as a data engineer. However, when you look at job postings or company tech stack, you are overwhelmed by the sheer amount of tools you have to learn! You feel o...
π‘ Top Recommendations:
Data Engineering Project: Stream Edition
Data Engineering Project: Stream Edition
- 1. Introduction
- 2. Sample project
- 3. Streaming concepts
- 4. Future work
- 5. Conclusion
- 6. Further reading
- 7. References
1. Introduction
Stream processing differs from batch; one needs to be mindful of the systemβs memory, event order, and system r...
π‘ Top Recommendations:
Data Engineering Best Practices - #1. Data flow & Code
Data Engineering Best Practices - #1. Data flow & Code
- 1. Introduction
- 2. Sample project
- 3. Best practices
- 3.1. Use standard patterns that progressively transform your data
- 3.2. Ensure data is valid before exposing it to its consumers (aka data quality checks)
- 3.3. Avoid data duplicates ...
π‘ Top Recommendations:
What is a self-serve data platform & how to build one
What is a self-serve data platform & how to build one
- 1. Introduction
- 2. What is self-serve?
- 3. Building a self-serve data platform
- 4. Conclusion
- 5. Further reading
- 6. References
1. Introduction
Most companies want to build a self-serve data platform. But what does a self-serve data plat...
π‘ Top Recommendations:
What is an Open Table Format? & Why to use one?
What is an Open Table Format? & Why to use one?
- 1. Introduction
- 2. What is an Open Table Format (OTF)
- 3. Why use an Open Table Format (OTF)
- 4. Conclusion
- 5. Further reading
- 6. References
1. Introduction
If you are in the data space, you might have heard of open table formats such as Apac...
π‘ Top Recommendations:
6 Steps to Avoid Messy Data in Your Warehouse
6 Steps to Avoid Messy Data in Your Warehouse
- 1. Introduction
- 2. Six Steps for a Clean Data Warehouse
- 2.1. Understand the business
- 2.2. Make data easy to use with the appropriate data model
- 2.3. Good input data is necessary for a good data warehouse
- 2.4. Define Source of Truth (SOT) and ...
π‘ Top Recommendations:
Uplevel your dbt workflow with these tools and techniques
Uplevel your dbt workflow with these tools and techniques
- 1. Introduction
- 2. Setup
- 3. Ways to uplevel your dbt workflow
- 3.1. Reproducible environment
- 3.2. Reduce feedback loop time when developing locally
- 3.3. Reduce the amount of code to write using dbt packages
- 3.4. Validate data bef...
π‘ Top Recommendations:
Data Engineering Best Practices - #2. Metadata & Logging
Data Engineering Best Practices - #2. Metadata & Logging
- 1. Introduction
- 2. Setup & Logging architecture
- 3. Data Pipeline Logging Best Practices
- 3.1. Metadata: Information about pipeline runs, & data flowing through your pipeline
- 3.2. Obtain visibility into the codeβs execution sequence us...
π‘ Top Recommendations:
How to test PySpark code with pytest
How to test PySpark code with pytest
- 1. Introduction
- 2. Ensure the codeβs logic is working as expected with tests
- 3. Conclusion
- 4. Further Reading
- 5. References
1. Introduction
Have you worked, or are you working with a code base that βmoved fastβ but had zero to no tests? Every minor feat...
π‘ Top Recommendations:
Docker Fundamentals for Data Engineers
Docker Fundamentals for Data Engineers
1. Introduction
Docker can be overwhelming to start with. Most data projects use Docker to set up the data infra locally (and often in production). Setting up data tools locally without Docker is (usually)a nightmare! The official docker documentation, while ex...
π‘ Top Recommendations:
How to reduce your Snowflake cost
How to reduce your Snowflake cost
- 1. Introduction
- 2. Snowflake pricing and settings inheritance model
- 3. Strategies to reduce Snowflake cost
- 4. Conclusion
- 5. Read more about using Snowflake
- 6. References
1. Introduction
Most data engineers love Snowflake, it is easy to get started, there...
π‘ Top Recommendations:
Building Cost Efficient Data Pipelines with Python & DuckDB
Building Cost Efficient Data Pipelines with Python & DuckDB
- 1. Introduction
- 2. Project demo
- 3. TL;DR
- 4. Considerations when building pipelines with DuckDB
- 4.1. β Use DuckDB to process data, not for multiple users to access data
- 4.2. β
Cost calculation: DuckDB + Ephemeral VMs = dirt cheap...
π‘ Top Recommendations:
Enable stakeholder data access with Text-to-SQL RAGs
Enable stakeholder data access with Text-to-SQL RAGs
- 1. Introduction
- 2. TL;DR
- 3. Enabling Stakeholder data access with RAGs
- 3.1. Set up
- 3.2. Loading: Read raw data and convert them into LlamaIndex data structures
- 3.3. Indexing: Generate & store numerical representation of your data
- 3.4...
π‘ Top Recommendations:
dbt(Data Build Tool) Tutorial
dbt(Data Build Tool) Tutorial
1. Introduction
If you are a student, analyst, engineer, or anyone in the data space and are curious about what dbt
is and how to use it. Then this post is for you.
If you are keen to understand why dbt is widely used, please read this article .
2. Dbt, the T in ELT
In ...
π‘ Top Recommendations:
Build Data Engineering Projects, with Free Template
Build Data Engineering Projects, with Free Template
- 1. Introduction
- 2. Run Data Pipeline
- 3. Architecture and services in this template
- 4. CI/CD setup
- 5. Putting it all together with a Makefile
- 6. Data projects using other tools and services
- 7. Conclusion
- 8. Further reading
- 9. Refer...
π‘ Top Recommendations:
Python Essentials for Data Engineers
Python Essentials for Data Engineers
- Introduction
- Data is stored on disk and processed in memory
- Practicing Python
- Python basics
- Python is used for extracting data from sources, transforming it, & loading it into a destination
- [Extract & Load] Read and write data to any system
- [Transfo...
π‘ Top Recommendations:
Data Engineering Projects
Data Engineering Projects
1. Introduction
Whether you are new to data engineering or have been in the data field for a few years, one of the most challenging parts of learning new frameworks is setting them up! Data infra is notoriously hard to set up. You want to improve your skills on a specific t...
π‘ Top Recommendations:
Data Engineering Project for Beginners - Batch edition
Data Engineering Project for Beginners - Batch edition
- 1. Introduction
- 2. Objective
- 3. Run Data Pipeline
- 4. Architecture
- 5. Code walkthrough
- 6. Design considerations
- 7. Next steps
- 8. Conclusion
- 9. Further reading
- 10. References
1. Introduction
An actual data engineering project u...
π‘ Top Recommendations: