Showing 20 of 732 URL(s)
(Page 19 of 37)
What, why, when to use Apache Kafka, with an example
What, why, when to use Apache Kafka, with an example
I have seen, heard and been asked questions and comments like
What is Kafka and When should I use it?
I donβt understand why we have to use Kafka
The objective of this post is to get you up to speed with what Apache Kafka is, when to use them and ...
π‘ Top Recommendations:
Ensuring Data Quality, With Great Expectations
Ensuring Data Quality, With Great Expectations
What is data quality
As the name suggest, it refers to the quality of our data. Quality
should be defined based on your project requirements. It can be as simple as ensuring a certain column has only the allowed values present or falls within a given ra...
π‘ Top Recommendations:
Designing a "low-effort" ELT system, using stitch and dbt
Designing a "low-effort" ELT system, using stitch and dbt
Intro
A very common use case in data engineering is to build a ETL system for a data warehouse, to have data loaded in from multiple separate databases to enable data analysts/scientists to be able to run queries on this data, since the sourc...
π‘ Top Recommendations:
How to Pull Data from an API, Using AWS Lambda
How to Pull Data from an API, Using AWS Lambda
Introduction
If you are looking for a simple, cheap data pipeline to pull small amounts of data from a stable API and store it in a cloud storage, then serverless functions
are a good choice. This post aims to answer questions like the ones shown below
...
π‘ Top Recommendations:
How to do Change Data Capture (CDC), using Singer
How to do Change Data Capture (CDC), using Singer
Introduction
Change data capture is a software design pattern used to track every change(update, insert, delete) to the data in a database. In most databases these types of changes are added to an append only log (Binlog
in MySQL, Write Ahead Log
in ...
π‘ Top Recommendations:
How to unit test sql transforms in dbt
How to unit test sql transforms in dbt
Introduction
With the recent advancements in data warehouses and tools like dbt
most transformations(T of ELT) are being done directly in the data warehouse. While this provides a lot of functionality out of the box, it gets tricky when you want to test your sq...
π‘ Top Recommendations:
How to Join a fact and a type 2 dimension (SCD2) table
How to Join a fact and a type 2 dimension (SCD2) table
- Introduction
- What is an SCD2 table and why use it?
- Setup
- Joining fact and SCD2 tables
- Conclusion
- Further reading
Introduction
If you are using a data warehouse, you would have heard of fact and dimension tables. Simply put, fact tabl...
π‘ Top Recommendations:
How to update millions of records in MySQL?
How to update millions of records in MySQL?
- Introduction
- Setup
- Problems with a single large update
- Updating in batches
- Conclusion
- Further reading
Introduction
When updating a large number of records in an OLTP database, such as MySQL, you have to be mindful about locking the records. If ...
π‘ Top Recommendations:
How to set up a dbt data-ops workflow, using dbt cloud and Snowflake
How to set up a dbt data-ops workflow, using dbt cloud and Snowflake
- Introduction
- Pre-requisites
- Setting up the data-ops pipeline
- Conclusion and next steps
- Further reading
- References
Introduction
With companies realizing the importance of having correct data, there has been a lot of atte...
π‘ Top Recommendations:
Apache Superset Tutorial
Apache Superset Tutorial
- Why data exploration
- Apache Superset architecture
- Setup
- Using Apache Superset
- Pros and Cons
- Conclusion
Why data exploration
In most companies the end users of a data warehouse are analysts, data scientists and business people. Visualizing data is a powerful tool ...
π‘ Top Recommendations:
How to trigger a spark job from AWS Lambda
How to trigger a spark job from AWS Lambda
- Event driven pipelines
- Lambda function to trigger spark jobs
- Setup and run
- Monitoring and logging
- Teardown
- Conclusion
- Further reading
- References
Event driven pipelines
Event driven systems represent a software design pattern where a logic is...
π‘ Top Recommendations:
Writing memory efficient data pipelines in Python
Writing memory efficient data pipelines in Python
- Introduction
- 1. Using generators
- 2. Using distributed frameworks
- Conclusion
- Further reading
- References
Introduction
If you are
Wondering how to write memory efficient data pipelines in python
Working with a dataset that is too large to fi...
π‘ Top Recommendations:
How to gather requirements to re-engineer a legacy data pipeline
How to gather requirements to re-engineer a legacy data pipeline
Introduction
As data engineers, you will have to re-engineer legacy data pipelines. While re-engineering data pipelines, if you have struggled with
a lack of clarity of deliverables among the projectβs stakeholders.
constantly being qu...
π‘ Top Recommendations:
Designing a Data Project to Impress Hiring Managers
Designing a Data Project to Impress Hiring Managers
- Introduction
- Objective
- Setup
- Project
- Future Work
- Tear down infra
- Conclusion
- Further Reading
- References
Introduction
Building a data project for your portfolio is hard. Getting hiring managers to read through your Github code is ev...
π‘ Top Recommendations:
How to make data pipelines idempotent
How to make data pipelines idempotent
- What is an idempotent function
- Pre-requisites
- Why idempotency matters
- Making your data pipeline idempotent
- Conclusion
- Further reading
- References
What is an idempotent function
βIdempotence is the property of certain operations in mathematics and co...
π‘ Top Recommendations:
4 Key Patterns to Load Data Into A Data Warehouse
4 Key Patterns to Load Data Into A Data Warehouse
Introduction
Loading data into a data warehouse is a key component of most data pipelines. If you are wondering
How to handle SQL loads
What are the patterns used to load data into a data warehouse?
Then this post is for you. In this post, we go over...
π‘ Top Recommendations:
How to Validate Datatypes in Python
How to Validate Datatypes in Python
- Introduction
- Using Native Python
- Using Pydantic
- Pydantic Caveats
- Conclusion
- Further reading
- References
Introduction
Data type issues are one of the biggest concerns when processing data in python. If you are wondering how to
Make sure that a column i...
π‘ Top Recommendations:
How to Scale Your Data Pipelines
How to Scale Your Data Pipelines
- 1. Introduction
- 2. What is scaling & why do we need it?
- 3. Types of scaling
- 4. Choose your scaling strategy
- 5. Conclusion
- 6. Further reading
- 7. References
1. Introduction
Choosing tools/frameworks to scale your data pipelines can be confusing. If you ha...
π‘ Top Recommendations:
Understand & Deliver on Your Data Engineering Task
Understand & Deliver on Your Data Engineering Task
- 1. Introduction
- 2. Understanding your data engineering task
- 3. Delivering your data engineering task
- 4. Conclusion
- 5. Further reading
1. Introduction
Congratulations! You are given a quick overview of the business and data architecture and...
π‘ Top Recommendations:
What is a staging area?
What is a staging area?
- 1. Introduction
- 2. What is a staging area
- 3. The advantages of having a staging area
- 5. Conclusion
- 6. Further reading
1. Introduction
Working with data pipelines, you might have noticed a staging
area in most data pipelines. If you work in the data space and have qu...
π‘ Top Recommendations: