Data Engineer
Data Engineer Lifecycle

Data lifecycle

Five V of data
Variety: can store many types of data structured or unstructured, semi-structured, image, video...
RDS
S3
Glue
Comprehend
Volume
Velocity (vận tốc)
Kinesis
Lambda
Veracity/Validity
Value
Data modeling
OLTP (Online Transaction Processing)
OLAP (Online Analytical Processing)
Latest state of data
Latest state of historical data
Normalization & 3rd normal
Normalization can cause lowness
Optimize for point queries
Query latency matter
Latency not as important
Optimize for GROUP BY
Common Table Expression (CTEs) can cause latency
Use CTEs instead of sub-queries
Services for transformation
Computing
Distributed computing
EMR
EMR
Lambda
Batch
Glue
Steps function
Redshift
Glue vs EMR
EMR
Glue
full feature, distributed Hadoop environment
fully managed ETL
additional framework & hardware
Data cleaning
Enrichment
Movement
Last updated