Data Engineer

Data Engineer Lifecycle

source: AWS skillbuilder

Data lifecycle

source: AWS skillbuilder

Five V of data

  • Variety: can store many types of data structured or unstructured, semi-structured, image, video...

    • RDS

    • S3

    • Glue

    • Comprehend

  • Volume

  • Velocity (vận tốc)

    • Kinesis

    • Lambda

  • Veracity/Validity

  • Value

Data modeling

OLTP (Online Transaction Processing)
OLAP (Online Analytical Processing)

Latest state of data

Latest state of historical data

Normalization & 3rd normal

Normalization can cause lowness

Optimize for point queries

Query latency matter

Latency not as important

Optimize for GROUP BY

Common Table Expression (CTEs) can cause latency

Use CTEs instead of sub-queries

Services for transformation

Computing
Distributed computing

EMR

EMR

Lambda

Batch

Glue

Steps function

Redshift

Glue vs EMR

EMR
Glue

full feature, distributed Hadoop environment

fully managed ETL

additional framework & hardware

  • Data cleaning

  • Enrichment

  • Movement

Last updated