AWS
DevOps
  • knowledge
    • glossary
    • network knowledge
      • CIDR Block
      • OSI
      • List of Ports
      • Network model
    • AWS best practices
      • Least privilege principle
      • Support Plan
      • Well-architected framework
        • Well-architected framework
        • Cost optimization
        • Operational Excellence
        • Performance efficiency
        • Reliability
        • Security
    • Exams
      • DOP-C02
        • DOP-C02 topics
        • DOP-C02 Labs
      • DVA-C02
      • SOA-C02
  • services
    • access management
      • Directory Service
      • IAM
        • PassRole
      • IAM Identity Center (SSO)
      • Organizations
        • Organizational Unit
        • Control Tower
      • AD Domain Service
    • analytics
      • data analytic
        • Athena
        • QuickSight
        • Redshift
      • data collection
        • Data Lake
        • Lake Formation
      • data processing
        • EMR
        • Kinesis
        • Glue
          • Glue Data Catalog
      • OpenSearch
    • compute
      • Batch
      • EC2
        • Auto Scaling
        • AMI
        • ELB
          • Global accelerator
        • Security Group
        • EBS
        • EC2 Instance Store
        • Spot Fleet
      • Elastic Beanstalk
      • Lambda
        • Layer
        • Lambda API
      • Outposts
      • Wavelength
      • SAM
      • VMWare Cloud
    • container
      • Copilot
      • ECR
      • ECS
        • ECS Anywhere
      • EKS
        • EKS Anywhere
        • EKS Distro
      • Fargate
    • cost management
      • Budgets
      • Cost Explorer
      • Saving Plans
      • Compute Optimizer
    • database
      • Data Engineer
      • Document DB
      • DynamoDB
        • DynamoDB API
        • Scan
      • ElastiCache
      • Keyspaces
      • MemoryDB for Redis
      • Neptune
      • Quantum Ledger Database
      • RDS
        • Aurora
          • Aurora Global Database
          • Aurora Serverless
      • Timestream
    • devTools
      • CICD
        • CodeArtifact
        • CodeCommit
        • CodeBuild
        • CodeDeploy
        • CodePipeline
      • CloudFormation
      • CodeGuru
      • CodeStar
      • CodeWhisperer
      • X-Ray
      • Deployment strategies
    • finance
      • Cost explorer
    • integration
      • AppFlow
      • AppSync
      • EventBridge
      • MQ
      • SNS
      • SQS
      • Step Functions
      • SWF
    • management
      • AppConfig
      • AWS Backup
      • AWS CDK
      • Config
      • Grafana
      • Health Dashboard
      • Proton
      • Service Catalog
      • System Manager
      • SSM
      • Resource Group
      • OpsWorks (discontinued)
    • media
      • Elemental MediaConvert
      • Transcoder
    • messaging
      • SES
    • migration
      • Application Migration Service
      • DataSync
      • DMS
      • Migration Evaluator
      • Migration Hub
      • Server Migration Service
      • Snow Family
      • Transfer Family
    • ML
      • Comprehend
      • Forecast
      • Kendra
      • Lex
      • Rekognition
      • SageMaker
        • SageMaker Data Wrangler
        • SageMaker ML Lineage Tracking
    • monitoring
      • CloudTrail
      • CloudWatch
      • TrustedAdvisor
    • networking
      • CloudFront
      • Customer gateway
      • Edge Location
      • hybrid networking
        • Direct Connect
          • Direct Connect Gateway
        • Site-to-site VPN
      • PrivateLink
      • Region
        • AZ
      • Route 53
      • Transit Gateway
      • VPC
        • VPC Lattice
        • Subnet
          • NACL
        • Internet Gateway
        • Network Firewall
        • VPN
        • NAT Gateway
      • Virtual Private Gateway
    • security
      • Artifact
      • ACM
      • CloudHSM
      • Cognito
      • Detective
      • Firewall Manager
      • GuardDuty
      • Inspector
      • KMS
      • Macie
      • Network Firewall
      • Resource Access Manager
      • Security Hub
      • Secret Manager
      • Secret Hub
      • Shield
      • STS
      • Trusted Advisor
      • WAF
    • storage
      • Backup
      • EBS
      • EFS
      • FSx
      • S3
        • S3 Glacier
        • S3 Snippet
        • S3 Mountpoint
      • Snow family
      • Storage gateway
      • WorkDocs
    • web & mobile
      • Amplify
      • API Gateway
      • Device Farm
      • Pinpoint
Powered by GitBook
On this page
  • Data Engineer Lifecycle
  • Data lifecycle
  • Five V of data
  • Data modeling
  • Services for transformation
  • Glue vs EMR
  1. services
  2. database

Data Engineer

PreviousdatabaseNextDocument DB

Last updated 8 months ago

Data Engineer Lifecycle

Data lifecycle

Five V of data

  • Variety: can store many types of data structured or unstructured, semi-structured, image, video...

    • RDS

    • S3

    • Glue

    • Comprehend

  • Volume

  • Velocity (vận tốc)

    • Kinesis

    • Lambda

  • Veracity/Validity

  • Value

Data modeling

OLTP (Online Transaction Processing)
OLAP (Online Analytical Processing)

Latest state of data

Latest state of historical data

Normalization & 3rd normal

Normalization can cause lowness

Optimize for point queries

Query latency matter

Latency not as important

Optimize for GROUP BY

Common Table Expression (CTEs) can cause latency

Use CTEs instead of sub-queries

Services for transformation

Computing
Distributed computing

EMR

EMR

Lambda

Batch

Glue

Steps function

Redshift

Glue vs EMR

EMR
Glue

full feature, distributed Hadoop environment

additional framework & hardware

  • Data cleaning

  • Enrichment

  • Movement

fully managed

ETL
source: AWS skillbuilder
source: AWS skillbuilder