Showing 20 of 732 URL(s)
(Page 21 of 37)
Data Engineering Project: Stream Edition
Data Engineering Project: Stream Edition
- 1. Introduction
- 2. Sample project
- 3. Streaming concepts
- 4. Future work
- 5. Conclusion
- 6. Further reading
- 7. References
1. Introduction
Stream processing differs from batch; one needs to be mindful of the systemβs memory, event order, and system r...
π‘ Top Recommendations:
Data Engineering Best Practices - #1. Data flow & Code
Data Engineering Best Practices - #1. Data flow & Code
- 1. Introduction
- 2. Sample project
- 3. Best practices
- 3.1. Use standard patterns that progressively transform your data
- 3.2. Ensure data is valid before exposing it to its consumers (aka data quality checks)
- 3.3. Avoid data duplicates ...
π‘ Top Recommendations:
What is a self-serve data platform & how to build one
What is a self-serve data platform & how to build one
- 1. Introduction
- 2. What is self-serve?
- 3. Building a self-serve data platform
- 4. Conclusion
- 5. Further reading
- 6. References
1. Introduction
Most companies want to build a self-serve data platform. But what does a self-serve data plat...
π‘ Top Recommendations:
What is an Open Table Format? & Why to use one?
What is an Open Table Format? & Why to use one?
- 1. Introduction
- 2. What is an Open Table Format (OTF)
- 3. Why use an Open Table Format (OTF)
- 4. Conclusion
- 5. Further reading
- 6. References
1. Introduction
If you are in the data space, you might have heard of open table formats such as Apac...
π‘ Top Recommendations:
6 Steps to Avoid Messy Data in Your Warehouse
6 Steps to Avoid Messy Data in Your Warehouse
- 1. Introduction
- 2. Six Steps for a Clean Data Warehouse
- 2.1. Understand the business
- 2.2. Make data easy to use with the appropriate data model
- 2.3. Good input data is necessary for a good data warehouse
- 2.4. Define Source of Truth (SOT) and ...
π‘ Top Recommendations:
Uplevel your dbt workflow with these tools and techniques
Uplevel your dbt workflow with these tools and techniques
- 1. Introduction
- 2. Setup
- 3. Ways to uplevel your dbt workflow
- 3.1. Reproducible environment
- 3.2. Reduce feedback loop time when developing locally
- 3.3. Reduce the amount of code to write using dbt packages
- 3.4. Validate data bef...
π‘ Top Recommendations:
Data Engineering Best Practices - #2. Metadata & Logging
Data Engineering Best Practices - #2. Metadata & Logging
- 1. Introduction
- 2. Setup & Logging architecture
- 3. Data Pipeline Logging Best Practices
- 3.1. Metadata: Information about pipeline runs, & data flowing through your pipeline
- 3.2. Obtain visibility into the codeβs execution sequence us...
π‘ Top Recommendations:
How to test PySpark code with pytest
How to test PySpark code with pytest
- 1. Introduction
- 2. Ensure the codeβs logic is working as expected with tests
- 3. Conclusion
- 4. Further Reading
- 5. References
1. Introduction
Have you worked, or are you working with a code base that βmoved fastβ but had zero to no tests? Every minor feat...
π‘ Top Recommendations:
Docker Fundamentals for Data Engineers
Docker Fundamentals for Data Engineers
1. Introduction
Docker can be overwhelming to start with. Most data projects use Docker to set up the data infra locally (and often in production). Setting up data tools locally without Docker is (usually)a nightmare! The official docker documentation, while ex...
π‘ Top Recommendations:
How to reduce your Snowflake cost
How to reduce your Snowflake cost
- 1. Introduction
- 2. Snowflake pricing and settings inheritance model
- 3. Strategies to reduce Snowflake cost
- 4. Conclusion
- 5. Read more about using Snowflake
- 6. References
1. Introduction
Most data engineers love Snowflake, it is easy to get started, there...
π‘ Top Recommendations:
Building Cost Efficient Data Pipelines with Python & DuckDB
Building Cost Efficient Data Pipelines with Python & DuckDB
- 1. Introduction
- 2. Project demo
- 3. TL;DR
- 4. Considerations when building pipelines with DuckDB
- 4.1. β Use DuckDB to process data, not for multiple users to access data
- 4.2. β
Cost calculation: DuckDB + Ephemeral VMs = dirt cheap...
π‘ Top Recommendations:
Enable stakeholder data access with Text-to-SQL RAGs
Enable stakeholder data access with Text-to-SQL RAGs
- 1. Introduction
- 2. TL;DR
- 3. Enabling Stakeholder data access with RAGs
- 3.1. Set up
- 3.2. Loading: Read raw data and convert them into LlamaIndex data structures
- 3.3. Indexing: Generate & store numerical representation of your data
- 3.4...
π‘ Top Recommendations:
dbt(Data Build Tool) Tutorial
dbt(Data Build Tool) Tutorial
1. Introduction
If you are a student, analyst, engineer, or anyone in the data space and are curious about what dbt
is and how to use it. Then this post is for you.
If you are keen to understand why dbt is widely used, please read this article .
2. Dbt, the T in ELT
In ...
π‘ Top Recommendations:
Build Data Engineering Projects, with Free Template
Build Data Engineering Projects, with Free Template
- 1. Introduction
- 2. Run Data Pipeline
- 3. Architecture and services in this template
- 4. CI/CD setup
- 5. Putting it all together with a Makefile
- 6. Data projects using other tools and services
- 7. Conclusion
- 8. Further reading
- 9. Refer...
π‘ Top Recommendations:
Python Essentials for Data Engineers
Python Essentials for Data Engineers
- Introduction
- Data is stored on disk and processed in memory
- Practicing Python
- Python basics
- Python is used for extracting data from sources, transforming it, & loading it into a destination
- [Extract & Load] Read and write data to any system
- [Transfo...
π‘ Top Recommendations:
Data Engineering Projects
Data Engineering Projects
1. Introduction
Whether you are new to data engineering or have been in the data field for a few years, one of the most challenging parts of learning new frameworks is setting them up! Data infra is notoriously hard to set up. You want to improve your skills on a specific t...
π‘ Top Recommendations:
Data Engineering Project for Beginners - Batch edition
Data Engineering Project for Beginners - Batch edition
- 1. Introduction
- 2. Objective
- 3. Run Data Pipeline
- 4. Architecture
- 5. Code walkthrough
- 6. Design considerations
- 7. Next steps
- 8. Conclusion
- 9. Further reading
- 10. References
1. Introduction
An actual data engineering project u...
π‘ Top Recommendations:
SQL or Python for Data Transformations?
SQL or Python for Data Transformations?
- 1. Introduction
- 2. Code is an interface to the execution engine
- 3. How to choose the execution engine and the coding interface
- 4. Conclusion
- 5. Further reading
- 6. References
1. Introduction
If you follow the data space, you would have noticed two c...
π‘ Top Recommendations:
Why use Apache Airflow (or any orchestrator)?
Why use Apache Airflow (or any orchestrator)?
- 1. Introduction
- 2. Features crucial to building and maintaining data pipelines
- 3. Conclusion
- 4. Further reading
1. Introduction
Are you trying to understand why someone would use a system like Airflow (or Dagster) to run simple scripts? If you ar...
π‘ Top Recommendations:
How to implement data quality checks with greatexpectations
How to implement data quality checks with greatexpectations
- 1. Introduction
- 2. Project overview
- 3. Check your data before making it available to end-users; Write-Audit-Publish(WAP) pattern
- 4. TL;DR: How the greatexpectations library works
- 5. From an implementation perspective, there are fo...
π‘ Top Recommendations: