As a data engineer/architect/director, I aim to create simple, efficient, and scalable data pipelines and architectures. “Simple is Better“. However, with the abundance of tools and abstractions in today's data ecosystem, achieving true simplicity can be challenging.
For instance, when designing a data lake architecture, using the latest technologies like Delta Lake, Iceberg ( my current favourite), or Hudi for all use cases is tempting. However, sometimes a simple partitioned parquet structure might suffice for many datasets, reducing complexity and maintenance overhead with a well-thought-out AWS S3 structure—it can be a pretty solid combo.
s3://your-data-lake-bucket/
├── raw/
│ ├── source1/
│ │ ├── YYYY/
│ │ │ ├── MM/
│ │ │ │ ├── DD/
│ │ │ │ │ └── data_files...
We have powerful tools like Airflow, Spark, and dbt that abstract away much of the complexity. While these tools are good, understanding their underlying mechanisms to avoid creating a 'black box' architecture that's difficult to debug and optimize - so resist adding this complexity until there is a real need; Simple is Better.
Elegant design and nimble data structures prevent future headaches.
Quality work brings rewards, such as the satisfaction of a job well done.
That is all for today -