Glue
serverless ETL service
Overview
A fully managed ETL (Extract, Transform, Load) service that can extract data from various sources, transform it into the required format, and load it into a target data store.
Use cases
Prepare & transform data for analytic.

Features
Glue - Convert data into Parquet format

Glue - Data Crawler: Catalog a dataset

Glue - Job bookmark
Prevent processing old data -> Glue can resume a job from where it left off.
Glue - Studio
GUI for create, run and monitor ETL jobs.
Glue - DataBrew
Glue DataBrew is a visual data preparation tool that allows you to clean, transform, and enrich data without writing code.
Features:
Data profiling: automatically analyze datasets to provide insights like missing values, outliers, and data distributions.
Pre-built transformation: offer over 250 transformations, such as filtering, normalizing, and deduplication.
Custom rules: allows you to define custom data quality rules to enforce specific standards (e.g., detecting PII or ensuring data completeness).
Integration: work with S3, Redshift, and Glue ETL.
Use cases:
Cleaning raw data from sources like S3 or databases.
Detecting and handling PII (Personally Identifiable Information).
Preparing data for machine learning or analytics in tools like Amazon Athena or Redshift.
Glue - Streaming ETL
Built on Apache Spark Structured Streaming
Compatible with Kinesis Data Streaming, Kafka, MSK (managed Kafka)
Best practices
Trivia
Data transformation = AWS Glue.
Concepts
Parquet format: is a columnar storage file format optimized for use with big data processing frameworks.
Last updated