Showing 20 of 636 URL(s)
(Page 14 of 32)
3 Key Points to Help You Partition Late Arriving Events
3 Key Points to Help You Partition Late Arriving Events
One of the most common issues when ingesting and processing user generated events is, how to deal with late arriving events. Yet this topic is not extensively discussed. Some of the general issues that data engineers usually have are
βWhat shou...
π‘ Top Recommendations:
A proven approach to land a Data Engineering job
A proven approach to land a Data Engineering job
I have seen and been asked the following questions by students, backend engineers and analysts who want to get into the data engineering industry.
What approach should i take to land a Data Engineering job?
I really want to get into DE. What can I do ...
π‘ Top Recommendations:
What Does It Mean for a Column to Be Indexed
What Does It Mean for a Column to Be Indexed
When optimizing queries on a database table, most developers tend to just create an index on the field to be queried. They have questions like
I donβt really understand what it means for a column to be βindexedβ
in addition to simply boosting the efficien...
π‘ Top Recommendations:
What, why, when to use Apache Kafka, with an example
What, why, when to use Apache Kafka, with an example
I have seen, heard and been asked questions and comments like
What is Kafka and When should I use it?
I donβt understand why we have to use Kafka
The objective of this post is to get you up to speed with what Apache Kafka is, when to use them and ...
π‘ Top Recommendations:
Ensuring Data Quality, With Great Expectations
Ensuring Data Quality, With Great Expectations
What is data quality
As the name suggest, it refers to the quality of our data. Quality
should be defined based on your project requirements. It can be as simple as ensuring a certain column has only the allowed values present or falls within a given ra...
π‘ Top Recommendations:
Designing a "low-effort" ELT system, using stitch and dbt
Designing a "low-effort" ELT system, using stitch and dbt
Intro
A very common use case in data engineering is to build a ETL system for a data warehouse, to have data loaded in from multiple separate databases to enable data analysts/scientists to be able to run queries on this data, since the sourc...
π‘ Top Recommendations:
How to Pull Data from an API, Using AWS Lambda
How to Pull Data from an API, Using AWS Lambda
Introduction
If you are looking for a simple, cheap data pipeline to pull small amounts of data from a stable API and store it in a cloud storage, then serverless functions
are a good choice. This post aims to answer questions like the ones shown below
...
π‘ Top Recommendations:
How to do Change Data Capture (CDC), using Singer
How to do Change Data Capture (CDC), using Singer
Introduction
Change data capture is a software design pattern used to track every change(update, insert, delete) to the data in a database. In most databases these types of changes are added to an append only log (Binlog
in MySQL, Write Ahead Log
in ...
π‘ Top Recommendations:
How to unit test sql transforms in dbt
How to unit test sql transforms in dbt
Introduction
With the recent advancements in data warehouses and tools like dbt
most transformations(T of ELT) are being done directly in the data warehouse. While this provides a lot of functionality out of the box, it gets tricky when you want to test your sq...
π‘ Top Recommendations:
How to Join a fact and a type 2 dimension (SCD2) table
How to Join a fact and a type 2 dimension (SCD2) table
- Introduction
- What is an SCD2 table and why use it?
- Setup
- Joining fact and SCD2 tables
- Conclusion
- Further reading
Introduction
If you are using a data warehouse, you would have heard of fact and dimension tables. Simply put, fact tabl...
π‘ Top Recommendations:
How to update millions of records in MySQL?
How to update millions of records in MySQL?
- Introduction
- Setup
- Problems with a single large update
- Updating in batches
- Conclusion
- Further reading
Introduction
When updating a large number of records in an OLTP database, such as MySQL, you have to be mindful about locking the records. If ...
π‘ Top Recommendations:
How to set up a dbt data-ops workflow, using dbt cloud and Snowflake
How to set up a dbt data-ops workflow, using dbt cloud and Snowflake
- Introduction
- Pre-requisites
- Setting up the data-ops pipeline
- Conclusion and next steps
- Further reading
- References
Introduction
With companies realizing the importance of having correct data, there has been a lot of atte...
π‘ Top Recommendations:
Apache Superset Tutorial
Apache Superset Tutorial
- Why data exploration
- Apache Superset architecture
- Setup
- Using Apache Superset
- Pros and Cons
- Conclusion
Why data exploration
In most companies the end users of a data warehouse are analysts, data scientists and business people. Visualizing data is a powerful tool ...
π‘ Top Recommendations:
How to trigger a spark job from AWS Lambda
How to trigger a spark job from AWS Lambda
- Event driven pipelines
- Lambda function to trigger spark jobs
- Setup and run
- Monitoring and logging
- Teardown
- Conclusion
- Further reading
- References
Event driven pipelines
Event driven systems represent a software design pattern where a logic is...
π‘ Top Recommendations:
Writing memory efficient data pipelines in Python
Writing memory efficient data pipelines in Python
- Introduction
- 1. Using generators
- 2. Using distributed frameworks
- Conclusion
- Further reading
- References
Introduction
If you are
Wondering how to write memory efficient data pipelines in python
Working with a dataset that is too large to fi...
π‘ Top Recommendations:
How to gather requirements to re-engineer a legacy data pipeline
How to gather requirements to re-engineer a legacy data pipeline
Introduction
As data engineers, you will have to re-engineer legacy data pipelines. While re-engineering data pipelines, if you have struggled with
a lack of clarity of deliverables among the projectβs stakeholders.
constantly being qu...
π‘ Top Recommendations:
Designing a Data Project to Impress Hiring Managers
Designing a Data Project to Impress Hiring Managers
- Introduction
- Objective
- Setup
- Project
- Future Work
- Tear down infra
- Conclusion
- Further Reading
- References
Introduction
Building a data project for your portfolio is hard. Getting hiring managers to read through your Github code is ev...
π‘ Top Recommendations:
How to make data pipelines idempotent
How to make data pipelines idempotent
- What is an idempotent function
- Pre-requisites
- Why idempotency matters
- Making your data pipeline idempotent
- Conclusion
- Further reading
- References
What is an idempotent function
βIdempotence is the property of certain operations in mathematics and co...
π‘ Top Recommendations:
4 Key Patterns to Load Data Into A Data Warehouse
4 Key Patterns to Load Data Into A Data Warehouse
Introduction
Loading data into a data warehouse is a key component of most data pipelines. If you are wondering
How to handle SQL loads
What are the patterns used to load data into a data warehouse?
Then this post is for you. In this post, we go over...
π‘ Top Recommendations:
How to Validate Datatypes in Python
How to Validate Datatypes in Python
- Introduction
- Using Native Python
- Using Pydantic
- Pydantic Caveats
- Conclusion
- Further reading
- References
Introduction
Data type issues are one of the biggest concerns when processing data in python. If you are wondering how to
Make sure that a column i...
π‘ Top Recommendations: