Redshift

10x faster, simple, cost-effective serverless data warehouse service

Document | Glossary |

Overview

  • Fast, serverless , petabyte-scale data warehouse service.

  • Support standard SQL and Business Intelligence (BI) tool -> allow complex analytic queries against terabytes to petabytes of structured and semi-structured data and massively parallel query execution.

Features

Redshift SQL

Redshift Spectrum

  • Query unstructured data in S3 without loading.

  • Limitless concurrency

  • Support Gzip and Snappy compression.

  • Best practices:

Use Parquet or ORC format (not raw CSV or JSON). ✅ Compress with Snappy, NOT Gzip. ✅ Partition data based on common query predicates (e.g., date, region). ✅ Keep file sizes between 100 MB and 500 MB for efficient parallelism.

RedShift Data Sharing

Redshift data sharing allows you to share live data across Redshift clusters without having to duplicate the data.

Distribution types

When you create a table, you can designate one of the following distribution styles: AUTO, EVEN, KEY, or ALL.

  • AUTO: Redshift assigns an optimal distribution style based on the size of the table data. Redshift initially assigns the ALL distribution style to a small table. When the table grows larger, Amazon Redshift might change the distribution style to KEY. If the table grows larger and none of the columns are suitable to be the distribution key, Amazon Redshift changes the distribution style to EVEN.

  • ALL < KEY < EVEN for table from small to large.

Best practices

  1. Choose the appropriate distribution style: Selecting the right distribution style for your tables (AUTO, EVEN, KEY, or ALL) can optimize performance by balancing data distribution and minimizing data movement across nodes.

  2. Use sort keys strategically: Implementing sort keys effectively can enhance query performance by helping Redshift minimize the amount of data it needs to scan.

  3. Monitor and manage workload: Regularly review and adjust your queries and workloads. Utilize Redshift's workload management features to ensure optimal performance.

  4. Regular vacuum and analyze operations: To maintain efficient storage and query performance, periodically run VACUUM and ANALYZE operations to reorder and clean up tables.

  5. Leverage compression encodings: Use appropriate compression encodings to reduce the size of your data storage, improving I/O efficiency and query speed.

Trivia

  • petabyte-scale = Redshift.

  • Redshift offers two forms of semistructured data support: the SUPER data type and Amazon Redshift Spectrum.

    • Use the SUPER data type if you need to insert or update small batches of JSON data with low latency.

    • Use Redshift Spectrum with an open file format if your data query requires integration with other AWS services and with data mainly stored in Amazon S3 for archival purposes.

Concepts

  • RA3: use a separate compute and storage architecture, which allows you to scale compute and storage independently. RA3 clusters seamlessly integrate with Amazon S3 for data storage, enabling you to leverage S3’s durability, scalability, and cost-effectiveness.

  • Open File Formats in Redshift Spectrum

    Amazon Redshift Spectrum extends the analytic capabilities of Amazon Redshift beyond data stored on local disks, allowing you to run queries against exabytes of data in Amazon S3 without loading or transforming the data. Redshift Spectrum supports a variety of open file formats, providing flexibility and efficiency when querying large datasets.

    Key Open File Formats Supported:

    • Apache Parquet: A columnar storage file format optimized for use with complex data. It provides efficient data compression and encoding schemes.

    • Apache ORC: Another columnar format that is highly optimized for storing Hive data but also works well with Redshift Spectrum.

    • Apache Avro: A row-based storage format suitable for data exchange, providing rich data structures and schemas.

    • Text formats (CSV, TSV): Simplified formats for smaller, less complex datasets, commonly used for data export and import.

    These formats are integral to enhancing performance and cost efficiency when analyzing large-scale datasets stored in S3 using Redshift Spectrum, as they help minimize data movement and optimize query execution.

Last updated