Showing 20 of 636 URL(s)
(Page 17 of 32)
SQL or Python for Data Transformations?
SQL or Python for Data Transformations?
- 1. Introduction
- 2. Code is an interface to the execution engine
- 3. How to choose the execution engine and the coding interface
- 4. Conclusion
- 5. Further reading
- 6. References
1. Introduction
If you follow the data space, you would have noticed two c...
💡 Top Recommendations:
Why use Apache Airflow (or any orchestrator)?
Why use Apache Airflow (or any orchestrator)?
- 1. Introduction
- 2. Features crucial to building and maintaining data pipelines
- 3. Conclusion
- 4. Further reading
1. Introduction
Are you trying to understand why someone would use a system like Airflow (or Dagster) to run simple scripts? If you ar...
💡 Top Recommendations:
How to implement data quality checks with greatexpectations
How to implement data quality checks with greatexpectations
- 1. Introduction
- 2. Project overview
- 3. Check your data before making it available to end-users; Write-Audit-Publish(WAP) pattern
- 4. TL;DR: How the greatexpectations library works
- 5. From an implementation perspective, there are fo...
💡 Top Recommendations:
What are the types of data quality checks?
What are the types of data quality checks?
- 1. Introduction
- 2. Data Quality(DQ) checks are run as part of your pipeline
- 3. Run a background data monitoring job
- 4. Not all DQ failures require you to stop the pipeline
- 5. Cost of DQ checks
- 6. Data quality tools
- 7. Conclusion
- 8. Further r...
💡 Top Recommendations:
Data Engineering Interview Preparation Series #1: Data Structures and Algorithms
Data Engineering Interview Preparation Series #1: Data Structures and Algorithms
- 1. Introduction
- 2. Data structures and algorithms to know
- 3. Common DSA questions asked during DE interviews
- 4. Company specific research
- 5. Conclusion
- 6. Further reading
1. Introduction
Preparing for data e...
💡 Top Recommendations:
How to build a data project with step-by-step instructions
How to build a data project with step-by-step instructions
- 1. Introduction
- 2. Setup
- 3. Parts of data engineering
- 3.1. Requirements
- 3.2. Identify what tool to use to process data
- 3.3. Data flow architecture
- 3.4. Data quality implementation
- 3.5. Code organization
- 3.6. Code testing
- ...
💡 Top Recommendations:
What are the Key Parts of Data Engineering?
What are the Key Parts of Data Engineering?
1. Introduction
If you are trying to break into (or land a new) data engineering job, you will inevitably encounter a slew of data engineering tools. The list of tools/frameworks to know can be overwhelming. If you are wondering
What are the parts of data ...
💡 Top Recommendations:
How to use nested data types effectively in SQL
How to use nested data types effectively in SQL
- 1. Introduction
- 2. Code & Data
- 3. Using nested data types effectively
- 4. Conclusion
- 5. Continue reading
1. Introduction
If you have worked in the data space, you’d inevitably come across tables with so many columns that it gets difficult to r...
💡 Top Recommendations:
How to decide on a data project for your portfolio
How to decide on a data project for your portfolio
1. Introduction
Whether you are looking to improve your data skills or building portfolio projects to land a job, you would have faced the issue of deciding what and how to build data projects. If you are
Struggling to decide what tools/frameworks t...
💡 Top Recommendations:
25 SQL tips to level up your data engineering skills
25 SQL tips to level up your data engineering skills
- Introduction
- Setup
- SQL tips
- 1. Handy functions for common data processing scenarios
- 1.1. Need to filter on WINDOW function without CTE/Subquery use QUALIFY
- 1.2. Need the first/last row in a partition, use DISTINCT ON
- 1.3. STRUCT data...
💡 Top Recommendations:
How to reference a seed from a different dbt project?
How to reference a seed from a different dbt project?
- 1. Introduction
- 2. Ways to reuse seed data across multiple dbt projects
- 3. dbt deps = download all dependency packages to your local dbt_packages folder
- 4. Conclusion
- 5. Further reading
1. Introduction
If your company has multiple dbt p...
💡 Top Recommendations:
What do Snowflake, Databricks, Redshift, BigQuery actually do?
What do Snowflake, Databricks, Redshift, BigQuery actually do?
- 1. Introduction
- 2. Analytical databases aggregate large amounts of data
- 3. Most platforms enable you to do the same thing but have different strengths
- 3.1. Understand how the platforms process data
- 3.1.1. A compute engine is a ...
💡 Top Recommendations:
Data Engineering Interview Preparation Series #2: System Design
Data Engineering Interview Preparation Series #2: System Design
- 1. Introduction
- 2. Guide the interviewer through the process
- 2.1. [Requirements gathering] Make sure you clearly understand the requirements & business use case
- 2.2. [Understand source data] Know what you have to work with
- 2.3...
💡 Top Recommendations:
How to turn a 1000-line messy SQL into a modular, & easy-to-maintain data pipeline?
How to turn a 1000-line messy SQL into a modular, & easy-to-maintain data pipeline?
1. Introduction
If you’ve been in the data space long enough, you would have come across really long SQL scripts that someone had written years ago. However, no one dares to touch them, as they may be powering some i...
💡 Top Recommendations:
How to ensure consistent metrics in your warehouse
How to ensure consistent metrics in your warehouse
1. Introduction
If you’ve worked on a data team, you’ve likely encountered situations where multiple teams define metrics in slightly different ways, leaving you to untangle why discrepancies exist.
The root cause of these metric deviations often st...
💡 Top Recommendations:
Should Data Pipelines in Python be Function based or Object-Oriented (OOP)?
Should Data Pipelines in Python be Function based or Object-Oriented (OOP)?
- 1. Introduction
- 2. Data transformations as functions lead to maintainable code
- 3. Objects help track things (aka state)
- 4. Class lets you define reusable code and pipeline patterns
- 5. Functional code uses objects v...
💡 Top Recommendations:
How to quickly deliver data to business users? #1. Adv Data types & Schema evolution
How to quickly deliver data to business users? #1. Adv Data types & Schema evolution
- 1. Introduction
- 2. Use Schema evolution & advanced data types to quickly deliver new columns to the end-user
- 3. Create systems to effectively leverage schema evolution
- 3.1. Auto schema evolution is high-risk...
💡 Top Recommendations:
How to Manage Upstream Schema Changes in Data Driven Fast Moving Company
How to Manage Upstream Schema Changes in Data Driven Fast Moving Company
- 1. Introduction
- 2.Strategies for data teams to handle changing schemas
- 3. Conclusion
- 4. Recommended reading
1. Introduction
If you have worked at a company that moves fast (or claims to), you’ve inevitably had to deal w...
💡 Top Recommendations:
Visual Studio Code (VSCode) extensions for data engineers
Visual Studio Code (VSCode) extensions for data engineers
- 1. Introduction
- 2. Python environment setup
- 3. VSCode Primer
- 4. Extensions overview
- 5. Privacy, Performance, and Cognitive Overload
- 6. Conclusion
- 7. Recommended reading
1. Introduction
Whether you are setting up visual studio co...
💡 Top Recommendations:
How to create an SCD2 Table using MERGE INTO with Spark & Iceberg
How to create an SCD2 Table using MERGE INTO with Spark & Iceberg
- 1. Introduction
- 2. MERGE INTO is used to UPDATE/DELETE/INSERT rows into a target table based on data in the source table
- 3. SCD2 table pipeline: INSERT new data, UPDATE existing data, and DELETE stale data
- 4. Conclusion
- 5. R...
💡 Top Recommendations: