Glue

serverless ETL service

Overview

  • A fully managed ETL (Extract, Transform, Load) service that can extract data from various sources, transform it into the required format, and load it into a target data store.

Use cases

  • Prepare & transform data for analytic.

ETL workload using Glue

Features

Glue - Convert data into Parquet format

convert to Parquet format

Glue - Data Crawler: Catalog a dataset

Catalog using Glue Data Crawler

Glue - Job bookmark

Prevent processing old data -> Glue can resume a job from where it left off.

Glue - Studio

GUI for create, run and monitor ETL jobs.

Glue - DataBrew

  • Glue DataBrew is a visual data preparation tool that allows you to clean, transform, and enrich data without writing code.

  • Features:

    • Data profiling: automatically analyze datasets to provide insights like missing values, outliers, and data distributions.

    • Pre-built transformation: offer over 250 transformations, such as filtering, normalizing, and deduplication.

    • Custom rules: allows you to define custom data quality rules to enforce specific standards (e.g., detecting PII or ensuring data completeness).

    • Integration: work with S3, Redshift, and Glue ETL.

  • Use cases:

    • Cleaning raw data from sources like S3 or databases.

    • Detecting and handling PII (Personally Identifiable Information).

    • Preparing data for machine learning or analytics in tools like Amazon Athena or Redshift.

Glue - Streaming ETL

  • Built on Apache Spark Structured Streaming

  • Compatible with Kinesis Data Streaming, Kafka, MSK (managed Kafka)

Best practices

Trivia

  • Data transformation = AWS Glue.

Concepts

  • Parquet format: is a columnar storage file format optimized for use with big data processing frameworks.

Last updated