Content Recommender

Data Engineering Project: Stream Edition

https://www.startdataengineering.com/post/data-engineering-project-for-beginners-stream-edition/

Domain: www.startdataengineering.com

Added: 2025-08-13 20:55:29

Status: ✓ Success

Text length: 10000 characters

www.startdataengineering.com

Data Engineering Project: Stream Edition - 1. Introduction - 2. Sample project - 3. Streaming concepts - 4. Future work - 5. Conclusion - 6. Further reading - 7. References 1. Introduction Stream processing differs from batch; one needs to be mindful of the system’s memory, event order, and system r...

💡 Top Recommendations:

Data Engineering Best Practices - #1. Data flow & Code

https://www.startdataengineering.com/post/de_best_practices/

Domain: www.startdataengineering.com

Added: 2025-08-13 20:55:28

Status: ✓ Success

Text length: 10000 characters

www.startdataengineering.com

Data Engineering Best Practices - #1. Data flow & Code - 1. Introduction - 2. Sample project - 3. Best practices - 3.1. Use standard patterns that progressively transform your data - 3.2. Ensure data is valid before exposing it to its consumers (aka data quality checks) - 3.3. Avoid data duplicates ...

💡 Top Recommendations:

What is a self-serve data platform & how to build one

https://www.startdataengineering.com/post/self-serve-data-platform/

Domain: www.startdataengineering.com

Added: 2025-08-13 20:55:28

Status: ✓ Success

Text length: 7423 characters

www.startdataengineering.com

What is a self-serve data platform & how to build one - 1. Introduction - 2. What is self-serve? - 3. Building a self-serve data platform - 4. Conclusion - 5. Further reading - 6. References 1. Introduction Most companies want to build a self-serve data platform. But what does a self-serve data plat...

💡 Top Recommendations:

What is an Open Table Format? & Why to use one?

https://www.startdataengineering.com/post/what_why_table_format/

Domain: www.startdataengineering.com

Added: 2025-08-13 20:55:27

Status: ✓ Success

Text length: 10000 characters

www.startdataengineering.com

What is an Open Table Format? & Why to use one? - 1. Introduction - 2. What is an Open Table Format (OTF) - 3. Why use an Open Table Format (OTF) - 4. Conclusion - 5. Further reading - 6. References 1. Introduction If you are in the data space, you might have heard of open table formats such as Apac...

💡 Top Recommendations:

6 Steps to Avoid Messy Data in Your Warehouse

https://www.startdataengineering.com/post/n-steps-avoid-messy-dw/

Domain: www.startdataengineering.com

Added: 2025-08-13 20:55:27

Status: ✓ Success

Text length: 10000 characters

www.startdataengineering.com

6 Steps to Avoid Messy Data in Your Warehouse - 1. Introduction - 2. Six Steps for a Clean Data Warehouse - 2.1. Understand the business - 2.2. Make data easy to use with the appropriate data model - 2.3. Good input data is necessary for a good data warehouse - 2.4. Define Source of Truth (SOT) and ...

💡 Top Recommendations:

Uplevel your dbt workflow with these tools and techniques

https://www.startdataengineering.com/post/uplevel-dbt-workflow/

Domain: www.startdataengineering.com

Added: 2025-08-13 20:55:26

Status: ✓ Success

Text length: 10000 characters

www.startdataengineering.com

Uplevel your dbt workflow with these tools and techniques - 1. Introduction - 2. Setup - 3. Ways to uplevel your dbt workflow - 3.1. Reproducible environment - 3.2. Reduce feedback loop time when developing locally - 3.3. Reduce the amount of code to write using dbt packages - 3.4. Validate data bef...

💡 Top Recommendations:

Data Engineering Best Practices - #2. Metadata & Logging

https://www.startdataengineering.com/post/de_best_practices_log/

Domain: www.startdataengineering.com

Added: 2025-08-13 20:55:25

Status: ✓ Success

Text length: 10000 characters

www.startdataengineering.com

Data Engineering Best Practices - #2. Metadata & Logging - 1. Introduction - 2. Setup & Logging architecture - 3. Data Pipeline Logging Best Practices - 3.1. Metadata: Information about pipeline runs, & data flowing through your pipeline - 3.2. Obtain visibility into the code’s execution sequence us...

💡 Top Recommendations:

How to test PySpark code with pytest

https://www.startdataengineering.com/post/test-pyspark/

Domain: www.startdataengineering.com

Added: 2025-08-13 20:55:24

Status: ✓ Success

Text length: 9537 characters

www.startdataengineering.com

How to test PySpark code with pytest - 1. Introduction - 2. Ensure the code’s logic is working as expected with tests - 3. Conclusion - 4. Further Reading - 5. References 1. Introduction Have you worked, or are you working with a code base that “moved fast” but had zero to no tests? Every minor feat...

💡 Top Recommendations:

Docker Fundamentals for Data Engineers

https://www.startdataengineering.com/post/docker-for-de/

Domain: www.startdataengineering.com

Added: 2025-08-13 20:55:24

Status: ✓ Success

Text length: 8439 characters

www.startdataengineering.com

Docker Fundamentals for Data Engineers 1. Introduction Docker can be overwhelming to start with. Most data projects use Docker to set up the data infra locally (and often in production). Setting up data tools locally without Docker is (usually)a nightmare! The official docker documentation, while ex...

💡 Top Recommendations:

How to reduce your Snowflake cost

https://www.startdataengineering.com/post/optimize-snowflake-cost/

Domain: www.startdataengineering.com

Added: 2025-08-13 20:55:23

Status: ✓ Success

Text length: 10000 characters

www.startdataengineering.com

How to reduce your Snowflake cost - 1. Introduction - 2. Snowflake pricing and settings inheritance model - 3. Strategies to reduce Snowflake cost - 4. Conclusion - 5. Read more about using Snowflake - 6. References 1. Introduction Most data engineers love Snowflake, it is easy to get started, there...

💡 Top Recommendations:

Building Cost Efficient Data Pipelines with Python & DuckDB

https://www.startdataengineering.com/post/cost-effective-pipelines/

Domain: www.startdataengineering.com

Added: 2025-08-13 20:55:22

Status: ✓ Success

Text length: 10000 characters

www.startdataengineering.com

Building Cost Efficient Data Pipelines with Python & DuckDB - 1. Introduction - 2. Project demo - 3. TL;DR - 4. Considerations when building pipelines with DuckDB - 4.1. ⭐ Use DuckDB to process data, not for multiple users to access data - 4.2. ✅ Cost calculation: DuckDB + Ephemeral VMs = dirt cheap...

💡 Top Recommendations:

Enable stakeholder data access with Text-to-SQL RAGs

https://www.startdataengineering.com/post/data-democratize-llm/

Domain: www.startdataengineering.com

Added: 2025-08-13 20:55:22

Status: ✓ Success

Text length: 10000 characters

www.startdataengineering.com

Enable stakeholder data access with Text-to-SQL RAGs - 1. Introduction - 2. TL;DR - 3. Enabling Stakeholder data access with RAGs - 3.1. Set up - 3.2. Loading: Read raw data and convert them into LlamaIndex data structures - 3.3. Indexing: Generate & store numerical representation of your data - 3.4...

💡 Top Recommendations:

dbt(Data Build Tool) Tutorial

https://www.startdataengineering.com/post/dbt-data-build-tool-tutorial/

Domain: www.startdataengineering.com

Added: 2025-08-13 20:55:21

Status: ✓ Success

Text length: 10000 characters

www.startdataengineering.com

dbt(Data Build Tool) Tutorial 1. Introduction If you are a student, analyst, engineer, or anyone in the data space and are curious about what dbt is and how to use it. Then this post is for you. If you are keen to understand why dbt is widely used, please read this article . 2. Dbt, the T in ELT In ...

💡 Top Recommendations:

Build Data Engineering Projects, with Free Template

https://www.startdataengineering.com/post/data-engineering-projects-with-free-template/

Domain: www.startdataengineering.com

Added: 2025-08-13 20:55:20

Status: ✓ Success

Text length: 6809 characters

www.startdataengineering.com

Build Data Engineering Projects, with Free Template - 1. Introduction - 2. Run Data Pipeline - 3. Architecture and services in this template - 4. CI/CD setup - 5. Putting it all together with a Makefile - 6. Data projects using other tools and services - 7. Conclusion - 8. Further reading - 9. Refer...

💡 Top Recommendations:

Python Essentials for Data Engineers

https://www.startdataengineering.com/post/python-for-de/

Domain: www.startdataengineering.com

Added: 2025-08-13 20:55:20

Status: ✓ Success

Text length: 10000 characters

www.startdataengineering.com

Python Essentials for Data Engineers - Introduction - Data is stored on disk and processed in memory - Practicing Python - Python basics - Python is used for extracting data from sources, transforming it, & loading it into a destination - [Extract & Load] Read and write data to any system - [Transfo...

💡 Top Recommendations:

Data Engineering Projects

https://www.startdataengineering.com/post/data-engineering-projects/

Domain: www.startdataengineering.com

Added: 2025-08-13 20:55:19

Status: ✓ Success

Text length: 6702 characters

www.startdataengineering.com

Data Engineering Projects 1. Introduction Whether you are new to data engineering or have been in the data field for a few years, one of the most challenging parts of learning new frameworks is setting them up! Data infra is notoriously hard to set up. You want to improve your skills on a specific t...

💡 Top Recommendations:

Data Engineering Project for Beginners - Batch edition

https://www.startdataengineering.com/post/data-engineering-project-for-beginners-batch-edition/

Domain: www.startdataengineering.com

Added: 2025-08-13 20:55:19

Status: ✓ Success

Hacker News: 🟠 12 points, 3 comments

Text length: 10000 characters

www.startdataengineering.com

Data Engineering Project for Beginners - Batch edition - 1. Introduction - 2. Objective - 3. Run Data Pipeline - 4. Architecture - 5. Code walkthrough - 6. Design considerations - 7. Next steps - 8. Conclusion - 9. Further reading - 10. References 1. Introduction An actual data engineering project u...

💡 Top Recommendations:

SQL or Python for Data Transformations?

https://www.startdataengineering.com/post/sql-v-python/

Domain: www.startdataengineering.com

Added: 2025-08-13 20:55:18

Status: ✓ Success

Text length: 10000 characters

www.startdataengineering.com

SQL or Python for Data Transformations? - 1. Introduction - 2. Code is an interface to the execution engine - 3. How to choose the execution engine and the coding interface - 4. Conclusion - 5. Further reading - 6. References 1. Introduction If you follow the data space, you would have noticed two c...

💡 Top Recommendations:

Why use Apache Airflow (or any orchestrator)?

https://www.startdataengineering.com/post/why-to-use-orchestrators/

Domain: www.startdataengineering.com

Added: 2025-08-13 20:55:18

Status: ✓ Success

Text length: 8628 characters

www.startdataengineering.com

Why use Apache Airflow (or any orchestrator)? - 1. Introduction - 2. Features crucial to building and maintaining data pipelines - 3. Conclusion - 4. Further reading 1. Introduction Are you trying to understand why someone would use a system like Airflow (or Dagster) to run simple scripts? If you ar...

💡 Top Recommendations:

How to implement data quality checks with greatexpectations

https://www.startdataengineering.com/post/implement_data_quality_with_great_expectations/

Domain: www.startdataengineering.com

Added: 2025-08-13 20:55:17

Status: ✓ Success

Text length: 8642 characters

www.startdataengineering.com

How to implement data quality checks with greatexpectations - 1. Introduction - 2. Project overview - 3. Check your data before making it available to end-users; Write-Audit-Publish(WAP) pattern - 4. TL;DR: How the greatexpectations library works - 5. From an implementation perspective, there are fo...

💡 Top Recommendations:

Sort & Filter Options

Read Status

Hacker News

Sort By

Filter By