AWS Data & Analytics

Amazon Athena – Overview

  • Serverless, interactive query service to analyze data directly in Amazon S3 using SQL (Presto engine).
  • No ETL needed – Athena reads data in-place.
  • Pay-per-query: Charges based on data scanned (per TB).
  • Works well with:
    • AWS Glue Data Catalog
    • S3 data lakes
    • Structured and semi-structured data (CSV, JSON, ORC, Parquet, Avro)
    • Amazon CloudTrail logs, VPC Flow Logs, Application Logs

Amazon Athena – Performance Improvement Techniques

Athena performance is highly dependent on data format, partitioning, and query design.

✅ Best Practices

Technique Benefit
1. Use Columnar Formats (Parquet, ORC) Reduces data scanned; faster queries
2. Partition Your Data Queries read only relevant partitions (e.g., partition by year, month, region)
3. Compress Data (Snappy, Gzip) Reduces I/O cost and scanning
4. Optimize File Size Use files between 128 MB – 1 GB for parallelism
5. Use Glue Data Catalog Predefine schemas, avoid schema inference overhead
**6. Avoid SELECT *** Scan only required columns to minimize data read
7. Optimize Joins Use broadcast joins for small datasets
8. Use CTAS (Create Table As Select) Store query results in optimized format for future queries
9. Use Workgroups Set query limits, track performance and cost
10. Tune query filters Push-down predicates to avoid full-table scans

💡 Example:
Instead of:

SELECT * FROM sales_data;

Use:

SELECT region, total_sales FROM sales_data WHERE year=2025 AND region='APAC';

Amazon Athena – Federated Query

Athena can query data across multiple sources, not just S3, using Athena Federated Query connectors.

✅ Supported Data Sources:

  • Amazon RDS (MySQL, PostgreSQL)
  • Amazon Redshift
  • Amazon DynamoDB
  • On-premises databases (via JDBC)
  • Cloud applications (e.g., Salesforce)

🔹 How It Works:

  1. Deploy a Data Source Connector (Lambda function) from Athena Query Federation SDK.
  2. Connector translates SQL into source-specific query.
  3. Athena combines results with S3-based datasets.

🔹 Example Use Case:

  • You have:
    • Historical data in S3
    • Real-time transactions in Amazon RDS
  • Athena federates query:
SELECT s.customer_id, s.amount, r.last_login
FROM s3_sales s
JOIN rds_customers r
ON s.customer_id = r.customer_id
WHERE s.date > '2025-01-01';

✅ Result: A single query joins multiple data sources, serverless, no ETL.


📊 Athena Key Advantages

  • Serverless: No infrastructure to manage
  • Scalable: Auto-scales based on query size
  • Integrated with AWS ecosystem (Glue, QuickSight, Redshift Spectrum, Lake Formation)
  • Cost-effective for ad-hoc analytics

🔥 TL;DR:

Feature Benefit
Athena Query S3 data serverlessly using SQL
Performance Optimization Use partitioning, Parquet/ORC, compression, selective queries
Federated Query Query across S3, RDS, Redshift, DynamoDB, JDBC sources

 


1️⃣ Amazon Redshift – Overview

  • Amazon Redshift is a fully managed, petabyte-scale data warehouse service on AWS.
  • Designed for Online Analytical Processing (OLAP) workloads, i.e., complex queries on large datasets for analytics and reporting.
  • Based on PostgreSQL, optimized for columnar storage and parallel query execution.

2️⃣ OLTP vs OLAP

Feature OLTP (Online Transaction Processing) OLAP (Online Analytical Processing – Redshift)
Purpose Handles real-time transactions Handles complex analytical queries
Operations Insert, update, delete, read (CRUD) Aggregate, filter, join, analyze large datasets
Data Volume GBs TBs to PBs
Normalization Highly normalized Denormalized for faster reads
Example Services Amazon RDS (MySQL, PostgreSQL, Aurora) Amazon Redshift, Athena, BigQuery
Latency Millisecond transactions Seconds to minutes for analytics queries

Redshift is OLAP → Ideal for BI, reporting, dashboards, and analytics.


3️⃣ Redshift Cluster

  • A cluster = One leader node + multiple compute nodes.
    • Leader Node: Receives queries, optimizes execution plan, distributes work.
    • Compute Nodes: Store data, process queries in parallel using Massively Parallel Processing (MPP).

Cluster Sizing:

  • Choose:
    • Node type: Dense Storage (HDD) or Dense Compute (SSD).
    • Number of nodes: More nodes = more compute & storage.
  • Auto-scaling is manual (need to resize or use RA3 with elastic scaling).

4️⃣ Snapshots & Disaster Recovery (DR)

  • Snapshots:
    • Point-in-time backups stored in Amazon S3.
    • Automatic snapshots taken periodically.
    • Manual snapshots can be retained indefinitely.
  • Disaster Recovery (DR):
    • Restore snapshot to any Redshift cluster in any AWS region.
    • Enables cross-region replication for DR and data migration.
    • Snapshots are incremental, saving only changes since the last backup.

5️⃣ Loading Data into Redshift

  • Best Practice: Use bulk inserts (COPY command) from:
    • Amazon S3
    • Amazon EMR
    • Amazon DynamoDB
    • Remote hosts via SSH

Avoid many small inserts (INSERT INTO) because:

  • Redshift is optimized for batch loading.
  • Small inserts → performance degradation due to:
    • High transaction overhead
    • Skewed distribution
    • Need for frequent VACUUM and ANALYZE operations

💡 Preferred approach:

COPY sales
FROM 's3://mybucket/sales_data/'
CREDENTIALS 'aws_iam_role=arn:aws:iam::1234567890:role/MyRedshiftRole'
CSV IGNOREHEADER 1;
  • Load in large files (1 MB to 1 GB each) for parallel ingestion.
  • Compress data (gzip, bzip2) to speed up load time and reduce costs.

6️⃣ Redshift Spectrum

  • Redshift Spectrum allows querying exabytes of data in S3 without loading it into Redshift.
  • Uses the same SQL syntax as Redshift.
  • Data is stored in open formats (Parquet, ORC, CSV, JSON).
  • How it works:
    • Redshift queries data in cluster tables and S3 external tables (via Glue Data Catalog).
    • Data processed in parallel across thousands of Redshift-managed query nodes.
    • Combine hot data (in Redshift) with cold data (in S3) in a single query.

Example:

SELECT region, SUM(sales)
FROM spectrum.sales_s3
WHERE year = 2025
GROUP BY region;

✅ Benefits:

  • Cost-effective: Pay only for data scanned in S3.
  • Seamless integration: Avoids ETL for historical or infrequently accessed data.
  • Scalable: Query PB-scale datasets.

✅ Summary Table

Feature Details
Workload type OLAP – analytical queries on large datasets
Cluster Leader + compute nodes, MPP architecture
Snapshots & DR Automated backups, cross-region restore
Data Loading Use COPY with large files; avoid small inserts
Redshift Spectrum Query S3 data directly without loading into cluster

 


1️⃣ Amazon OpenSearch Service – Overview

  • Amazon OpenSearch Service (successor to Amazon Elasticsearch Service) is a fully managed search and analytics engine.
  • Supports:
    • Full-text search
    • Log analytics
    • Real-time application monitoring
    • Interactive data visualization (via OpenSearch Dashboards)

✅ Key Capabilities:

  • Index and search structured & unstructured data.
  • Near real-time data ingestion.
  • Scalable and highly available (multi-AZ support).
  • Integrates easily with AWS data sources (DynamoDB, CloudWatch, S3, Kinesis, Lambda).

2️⃣ OpenSearch Pattern – DynamoDB → OpenSearch

Use Case:

Enable fast search on non-key attributes of DynamoDB data.

Architecture:

  1. DynamoDB Streams → Captures changes (INSERT, MODIFY, REMOVE).
  2. AWS Lambda function → Triggers on DynamoDB stream events.
  3. Lambda transforms and pushes data to Amazon OpenSearch Service index.

Flow:

DynamoDB → Streams → Lambda → OpenSearch Index → Search Query API

✅ Benefits:

  • DynamoDB is great for key-value lookups, but lacks complex text search.
  • OpenSearch provides:
    • Full-text search
    • Fuzzy matching
    • Aggregations & filtering

Example Lambda Pseudocode:

from opensearchpy import OpenSearch
import boto3

client = OpenSearch(
    hosts=[{'host': 'my-domain.region.es.amazonaws.com', 'port': 443}],
    use_ssl=True
)

def lambda_handler(event, context):
    for record in event['Records']:
        doc_id = record['dynamodb']['Keys']['id']['S']
        document = record['dynamodb']['NewImage']
        client.index(index="products", id=doc_id, body=document)

3️⃣ OpenSearch Pattern – CloudWatch Logs → OpenSearch

Use Case:

Enable log analytics and visualization (similar to ELK stack) for operational monitoring.

Architecture:

  1. CloudWatch Logs capture application/system logs.
  2. CloudWatch Logs Subscriptions send data to:
    1. AWS Lambda → transforms data → OpenSearch
    2. OR Kinesis Firehose → delivers data to OpenSearch in bulk.
  3. OpenSearch Dashboards visualize logs in near real-time.

Flow:

CloudWatch Logs → (Lambda/Kinesis) → OpenSearch Index → Dashboards

✅ Benefits:

  • Centralized log search for multi-service apps.
  • Pattern matching, anomaly detection on logs.
  • Low-latency insights for troubleshooting.

Example:

  • A subscription filter:

aws logs put-subscription-filter \
    --log-group-name my-app-logs \
    --filter-name myfilter \
    --filter-pattern "" \
    --destination-arn arn:aws:firehose:region:account-id:deliverystream/myStream
  • Kinesis Firehose → OpenSearch index → Real-time log search.


4️⃣ General OpenSearch Patterns

Pattern Description Example Services
Searchable Data Lake Index S3 data for quick search S3 → Lambda → OpenSearch
Centralized Logging Collect logs from multiple services for analysis CloudWatch, Lambda, ECS logs → OpenSearch
E-commerce Search Product catalog search with typo tolerance, autocomplete DynamoDB → OpenSearch → API Gateway
Security Analytics (SIEM) Analyze VPC Flow Logs, GuardDuty, WAF logs CloudWatch Logs → OpenSearch → Kibana Dashboards
Application Monitoring (APM) Ingest app metrics, errors, traces X-Ray or CloudWatch Metrics → OpenSearch
Real-time Recommendation Engine Query large datasets fast DynamoDB + OpenSearch for non-key searches

✅ Summary Table

Integration Pattern Key Benefit
DynamoDB → OpenSearch Advanced search on NoSQL data
CloudWatch Logs → OpenSearch Real-time log analytics & monitoring
Data Lake → OpenSearch Search over structured/unstructured S3 data
Security Event Logs → OpenSearch Threat detection, compliance checks

 


1️⃣ Amazon Kinesis Overview

Amazon Kinesis is a real-time, fully managed data streaming platform that enables you to collect, process, and analyze large streams of data.

It has multiple services:

  • Kinesis Data Streams (KDS) → Real-time custom data processing
  • Kinesis Data Firehose (KDF) → Load streaming data into AWS destinations (no code needed)
  • Kinesis Data Analytics → Run SQL queries on streaming data
  • Kinesis Video Streams → Stream video data

2️⃣ Kinesis Data Streams (KDS)

Definition:

  • A real-time streaming data ingestion and processing service.
  • Lets you build custom consumers (apps) to process data within milliseconds of arrival.

Architecture:

  1. Producers: Send data records to Kinesis Stream (IoT devices, apps, services).
  2. Kinesis Stream: Organizes data into shards (units of capacity).
  3. Consumers: Applications (AWS Lambda, EC2, KCL apps) process and store data.

Key Features:

  • Real-time latency: Data available for processing in <1 second.
  • Custom processing: Flexible consumer apps.
  • Data retention: 24 hours to 365 days.
  • Ordering: Guaranteed order per shard.
  • Scaling: Add/remove shards for throughput scaling.

Example Use Case:

  • Stock price tracking: Capture millions of market events per second.
  • Fraud detection: Real-time pattern analysis of transactions.


3️⃣ Kinesis Data Firehose (KDF)

Definition:

  • A fully managed service for loading streaming data to AWS destinations without building consumers.
  • No code required, handles scaling, buffering, and retries automatically.

Architecture:

  1. Producers: Send data (apps, IoT, CloudWatch Logs, etc.).
  2. Firehose delivery stream: Buffers and optionally transforms data.
  3. Destinations: Amazon S3, Amazon Redshift, Amazon OpenSearch Service, or custom HTTP endpoints.

Key Features:

  • Near real-time: Latency is ~1 min (buffer interval).
  • Fully managed: No infrastructure to manage.
  • Automatic scaling: Handles any throughput.
  • Optional transformation: Via AWS Lambda before delivery.
  • Compression & encryption: To reduce costs and improve security.

Example Use Case:

  • Centralized log storage: Automatically collect app logs and store them in S3 for later analysis.
  • ETL pipelines: Stream events directly into Redshift for analytics.


4️⃣ Kinesis Data Streams vs. Firehose

Feature Kinesis Data Streams (KDS) Kinesis Data Firehose (KDF)
Processing Build custom apps or Lambda for real-time Automatic delivery, minimal processing
Latency < 1 sec ~1 min (buffered)
Data Retention 24 hours – 365 days No retention (data immediately delivered)
Scaling Manual (add shards) Automatic
Destinations Anywhere (custom consumers) S3, Redshift, OpenSearch, HTTP endpoint
Use Case Real-time analytics, event-driven apps Data lake, log aggregation, archiving
Cost Pay per shard + PUT payload Pay per data volume ingested


5️⃣ When to Use Each?

Use KDS when:

  • You need real-time, low-latency processing.
  • You want multiple consumers processing the same data stream.
  • You need custom logic before storing or forwarding data.

Use KDF when:

  • You need easy ingestion → storage with minimal coding.
  • You want automatic, reliable delivery to S3, Redshift, or OpenSearch.
  • You are okay with slight delay (buffering) instead of millisecond latency.

6️⃣ Integration Example

  • IoT sensors → KDS → AWS Lambda → Real-time dashboard.
  • Same data → Firehose → S3 for historical data lake → Athena/Redshift for BI.

1️⃣ Amazon EMR – Overview

Amazon EMR (Elastic MapReduce) is a managed big data platform that allows you to process vast amounts of data using open-source frameworks like:

  • Apache Hadoop
  • Apache Spark
  • Apache Hive
  • Apache HBase
  • Presto
  • Flink

Key Features:

  • Easily run big data frameworks without manual cluster setup.
  • Scalable clusters with automatic resizing.
  • Cost-effective via Spot instances and auto-termination.
  • Integrates with S3, DynamoDB, Redshift, RDS, and Kinesis for data storage and analytics.

Amazon EMR – Node Types

An EMR cluster consists of 3 types of nodes:


(a) Master Node

  • Role: Manages the cluster, coordinates tasks, and monitors health.
  • Runs:
    • Cluster Manager (YARN, Mesos, etc.)
    • Hadoop NameNode
  • High availability: Optional multi-master for fault tolerance.

(b) Core Node

  • Role: Runs data processing tasks and stores data in HDFS.
  • Mandatory: At least one core node is required.
  • Persistent storage: Data in core nodes is replicated across nodes.

(c) Task Node (Optional)

  • Role: Used only for processing tasks, does not store data.
  • Elastic scaling: You can add/remove task nodes based on demand (e.g., Spot instances).

Node Summary Table

Node Type Function Data Storage Mandatory
Master Cluster coordination, task scheduling No Yes
Core Task execution + HDFS storage Yes Yes
Task Task execution only (scaling compute) No Optional

Amazon EMR – Node Purchasing Options

EMR runs on Amazon EC2 instances. You can choose from multiple pricing models:


(a) On-Demand Instances

  • Pay for EC2 capacity by the hour or second.
  • No long-term commitment.
  • Best for: Short-lived, unpredictable workloads.

(b) Reserved Instances

  • Commit to 1-year or 3-year term for a lower hourly rate.
  • Best for: Steady-state big data workloads with predictable usage.

(c) Spot Instances

  • Purchase unused EC2 capacity at up to 90% discount.
  • Best for: Task nodes or fault-tolerant workloads (non-critical jobs).
  • Risk: Instance may be terminated when AWS reclaims capacity.

(d) Instance Fleets (Mixed purchasing)

  • Combine On-Demand + Reserved + Spot instances for cost optimization.
  • Flexible across multiple EC2 instance types for maximum cluster availability.

✅ Summary Cheat Sheet

Node Type Role Storage Mandatory
Master Manages cluster & tasks No Yes
Core Runs tasks + stores data in HDFS Yes Yes
Task Runs tasks only No Optional
Pricing Option Use Case
On-Demand Short, unpredictable jobs
Reserved Long-term, predictable usage
Spot Flexible, fault-tolerant jobs
Instance Fleets Mix of all for cost savings

Here’s a structured explanation of Amazon QuickSight, its integrations, and dashboard/analysis features for your Cert Tutorials:


1️⃣ Amazon QuickSight – Overview

Amazon QuickSight is a serverless, cloud-based business intelligence (BI) service that allows users to:

  • Connect to multiple data sources.
  • Perform data analysis and visualization.
  • Build interactive dashboards accessible via browsers and mobile devices.
  • Use machine learning insights for anomaly detection and forecasting.

Key Benefits:

  • Fully managed, no infrastructure to provision.
  • Pay-per-session pricing (only pay when users access dashboards).
  • Scales to thousands of users instantly.

QuickSight – Integrations

QuickSight can connect to a wide range of AWS services, on-premises, and third-party data sources, including:

a) AWS Native Integrations

  • Amazon S3 → CSV, JSON, Parquet files.
  • Amazon RDS (MySQL, PostgreSQL, Oracle, SQL Server).
  • Amazon Aurora.
  • Amazon Redshift.
  • Amazon Athena → Query data in S3 using SQL.
  • Amazon OpenSearch Service.
  • AWS IoT Analytics.
  • AWS Glue Data Catalog.

b) Other Cloud and External Integrations

  • Salesforce.
  • ServiceNow.
  • Twitter.
  • Jira.
  • Adobe Analytics.

c) On-Premises Databases

  • MySQL, PostgreSQL, Microsoft SQL Server, Oracle, MariaDB via JDBC/ODBC connection or AWS Direct Connect/VPN.

d) File-based Sources

  • CSV, XLSX files uploaded directly to QuickSight.


3️⃣ QuickSight – Dashboard & Analysis

a) Analysis in QuickSight

An Analysis is where you:

  • Prepare and explore datasets.
  • Build charts, graphs, and tables.
  • Apply filters, calculated fields, and aggregations.
  • Use ML-powered insights (forecasting, anomaly detection).

✅ You can create:

  • Bar charts, line charts, pie charts.
  • Heat maps, pivot tables.
  • KPI widgets.
  • Geospatial maps.

Once the analysis is ready, you can publish it as a dashboard.


b) Dashboards in QuickSight

A Dashboard is:

  • A read-only, interactive view shared with other users.
  • Based on an analysis you created.
  • Auto-updated when data changes.

Key Features:

  • Drill-down and filter options.
  • Embedded dashboards in applications using QuickSight SDK.
  • Accessible via web browser or mobile app.
  • Supports row-level security to restrict data per user.
  • Can be scheduled for email reports.

Example Use Case:

  1. Data in S3 → Athena → QuickSight analysis.
  2. Build a sales performance dashboard.
  3. Share with the sales team with role-based access.
  4. Enable ML-powered forecasting for upcoming sales trends.

✅ QuickSight Cheat Sheet

Feature Description
Integrations S3, Athena, RDS, Redshift, OpenSearch, Salesforce, On-prem DBs
Analysis Data exploration, visualization building, ML insights
Dashboard Published interactive, read-only reports for users
Access Control Row-level security and IAM integration
Pricing Pay-per-user or pay-per-session

 

AWS Glue – Overview

AWS Glue is a serverless ETL (Extract, Transform, Load) and data integration service that helps you:

  • Discover, prepare, and combine data from multiple sources.
  • Transform raw data into usable formats for analytics and machine learning.
  • Build data lakes, data warehouses, and analytics pipelines.

AWS Glue – Convert Data into Parquet Format

Parquet is a columnar storage format that provides:

  • Better query performance than row-based formats (like CSV or JSON).
  • Efficient storage, with built-in compression.

How AWS Glue Converts Data to Parquet:

  1. Glue reads source data (CSV, JSON, relational DB, etc.).
  2. Uses a Glue ETL Job (PySpark or Scala) to process and transform the data.
  3. Writes the output to Amazon S3 in Parquet format.

Example ETL Flow

Source Data → Glue Crawler → Glue Job → Convert to Parquet → Store in S3 → Query via Athena/Redshift Spectrum

Benefits of Parquet conversion in Glue:

  • Optimized for big data analytics.
  • Works well with Athena, Redshift Spectrum, EMR, and QuickSight.
  • Reduces storage cost and query execution time.

3️⃣ AWS Glue Data Catalog

The Glue Data Catalog is a central metadata repository that stores:

  • Table definitions
  • Schema information
  • Partition locations
  • Data source connections

Key Points:

  • Works as a "Hive Metastore" replacement for AWS.
  • AWS Athena, Redshift Spectrum, EMR, and QuickSight rely on it to understand data structure.
  • Populated using Glue Crawlers:
    • Crawlers scan S3 or databases.
    • Automatically infer schema and store metadata.

Example Use Case:

  • Data stored in S3 in different formats.
  • Glue crawler detects tables and schema → Creates catalog entries.
  • Athena queries data using standard SQL without manual schema setup.

4️⃣ AWS Glue – Things to Know (High-Level)

  • Serverless: No infrastructure to manage; scales automatically.
  • ETL Jobs: Written in PySpark or Scala, run in a managed Apache Spark environment.
  • Job Triggers: Can be scheduled or event-driven (e.g., S3 file upload).
  • Integration: Works seamlessly with S3, Redshift, Athena, Lake Formation, DynamoDB.
  • Job Bookmarks: Prevent reprocessing of previously processed data.
  • Pricing: Based on Data Processing Units (DPUs) used per job.
  • Glue Studio: Provides a visual interface to build ETL pipelines without coding.
  • AWS Glue DataBrew: For low-code/no-code data preparation tasks.

✅ Quick Cheat Sheet

Feature Description
Parquet Conversion ETL job converts data to Parquet for better query performance.
Data Catalog Central repository of dataset schemas for AWS analytics services
Serverless ETL Scalable data transformation with PySpark or visual workflows
Integration Works with S3, Athena, Redshift, EMR, QuickSight
Cost Pay-per-DPU, serverless

AWS Lake Formation – Overview

AWS Lake Formation is a managed service for building secure data lakes on AWS.
It helps you:

  • Collect, catalog, clean, secure, and share data from multiple sources.
  • Enforce fine-grained permissions (table, column, row-level) for analytics.
  • Manage centralized access control across services like Athena, Redshift Spectrum, EMR, Glue, and QuickSight.

2️⃣ Key Components of Lake Formation

  • Data Lake: Data stored in Amazon S3 in raw or processed format.
  • Glue Data Catalog: Metadata repository (integrates with Lake Formation).
  • Permissions & Policies: Grant granular access to databases, tables, or even columns.
  • Data Sharing: Securely share datasets across AWS accounts or organizations.

3️⃣ Centralized Permissions in AWS Lake Formation

Lake Formation allows centralized access control for all data lake users.
Permissions are defined once in Lake Formation and apply to all integrated services.

Types of Permissions:

  • Data Location Permissions: Access to underlying S3 buckets.
  • Database & Table Permissions:
    • SELECT, ALTER, DROP, CREATE, DESCRIBE
    • GRANT/REVOKE to other principals.
  • Column-level Permissions: Limit access to specific fields (e.g., hide salary column).
  • Row-level Permissions: Provide conditional access to subsets of data.

4️⃣ Example – Centralized Permissions

Scenario:

  • A retail company stores sales data in Amazon S3 (sales_db database).
  • They have three teams: Finance, Marketing, and Data Science.
  • Data is queried using Athena and visualized in QuickSight.

Step 1: Set up the Data Lake

  • Data ingested from multiple sources into S3 bucket.
  • AWS Glue Crawler populates Data Catalog with sales_db and transactions table.

Step 2: Define Permissions in Lake Formation

  1. Finance Team:

    • Full access to all columns.

    • SQL query example in Athena:

      SELECT * FROM sales_db.transactions;
      
  2. Marketing Team:

    • Access only to region, product, and revenue columns.

    • Deny sensitive fields (e.g., customer_email).

    • Permission granted at column level:

      Permissions: SELECT (region, product, revenue)
      
  3. Data Science Team:

    • Row-level filtering: Access only to transactions where region = 'US'.

    • Example policy:

      Filter: region = 'US'
      Permissions: SELECT *
      

Step 3: Users Query Data via Integrated Services

  • Athena: Users run queries but only see data allowed by permissions.
  • QuickSight: Dashboards show restricted data based on team access.
  • Redshift Spectrum or EMR: Same permissions enforced automatically.

✅ Benefits of Centralized Permissions

  • Single source of truth: Permissions managed centrally in Lake Formation.
  • Consistent security: Applied across Athena, Redshift Spectrum, EMR, and QuickSight.
  • Granular access: Table, column, and row-level controls.
  • Cross-account sharing: Securely share data with other AWS accounts.

✅ Cheat Sheet

Feature Description
Centralized Access One permission model for all analytics services
Column & Row-level Restrict sensitive fields or specific dataset portions
S3 Integration Works with Glue Data Catalog + S3 as data lake
Cross-account Share data securely with other AWS accounts
Supported Services Athena, Redshift Spectrum, EMR, Glue, QuickSight

Kinesis Data Analytics (KDA)

AWS Kinesis Data Analytics (KDA) is a serverless service that allows you to analyze streaming data in real-time using:

  • SQL-based applications or
  • Apache Flink applications

It processes data from sources like Kinesis Data Streams or Amazon MSK (Kafka) and outputs results to destinations like S3, DynamoDB, Lambda, or Redshift.


Kinesis Data Analytics for SQL Applications

  • Enables real-time analytics on streaming data using standard SQL queries.
  • Ideal for developers familiar with SQL but not programming frameworks.
  • Can filter, aggregate, join, and transform streaming data.

Example Use Case:

  • An e-commerce website streams order events into Kinesis Data Streams.
  • A KDA SQL application calculates total sales per product every minute.
  • Output is sent to S3 or Redshift for BI dashboards.

3️⃣ Kinesis Data Analytics (SQL Application)

Key Features:

  • Continuous SQL queries: Runs non-stop on incoming data.
  • Windowed operations: Allows time-based aggregations.
  • Real-time dashboards: Monitor trends and anomalies instantly.

Example Query:

CREATE OR REPLACE STREAM "DESTINATION_SQL_STREAM" 
AS SELECT product_id, SUM(price) AS total_sales
FROM "SOURCE_SQL_STREAM_001"
GROUP BY product_id, STEP("INTERVAL" BY INTERVAL '1' MINUTE);

Benefit: Low-code solution for streaming ETL and analytics.


4️⃣ Kinesis Data Analytics for Apache Flink

  • Enables advanced stream processing using Apache Flink, an open-source, distributed framework.
  • Supports Java, Scala, and Python applications for complex use cases:
    • Event-time processing
    • Stateful stream applications
    • Machine learning on streaming data

Example Use Case:

  • Fraud detection on payment transactions using Flink to:
    • Correlate events across time windows
    • Detect anomalies in real time

Benefit: High flexibility & advanced capabilities compared to SQL-based apps.


Amazon Managed Streaming for Apache Kafka (MSK)

Amazon MSK is a fully managed service for Apache Kafka, a popular open-source event streaming platform.

  • Handles high-throughput, real-time data ingestion.
  • Provides durable storage of streams with strong ordering guarantees.
  • Used when organizations already use Kafka or need more control than Kinesis.

Key Features:

  • AWS manages Kafka cluster provisioning, patching, scaling, and monitoring.
  • Integrates with EC2, Lambda, ECS, EKS, and on-premise apps.
  • MSK Serverless option for automatic scaling.

6️⃣ Kinesis Data Streams vs. Amazon MSK

Feature Kinesis Data Streams Amazon MSK (Kafka)
Type AWS-native managed streaming service Managed Apache Kafka (open source)
Setup Fully serverless Requires creating Kafka clusters
Throughput Scaling Automatic with shard scaling Manual or via autoscaling
Ecosystem Works tightly with AWS services Open-source Kafka ecosystem supported
Latency Milliseconds Low milliseconds
Use Case AWS-first, quick setup, predictable workloads Existing Kafka clients, hybrid environments

✅ Choose Kinesis for AWS-native, simpler workloads.
✅ Choose MSK if you need Kafka-specific features or existing Kafka applications.


7️⃣ Amazon MSK Consumers

  • Consumers are applications that read data from Kafka topics.
  • Examples:
    • AWS Lambda function triggered by Kafka events.
    • EC2/ECS/EKS services consuming topic messages.
    • Kinesis Data Analytics for Apache Flink reading from MSK for processing.
    • Third-party tools (e.g., Logstash, Spark Streaming) integrated via Kafka APIs.

8️⃣ Big Data Ingestion Pipeline (Example Architecture)

Goal: Process millions of incoming data events (e.g., IoT sensor data, clickstreams) in near real time.

Step-by-Step Flow:

  1. Data Producers:

    • Mobile apps, IoT devices, websites generate events.

  2. Streaming Ingestion Layer:

    • Kinesis Data Streams or MSK ingests data.

  3. Processing Layer:

    • Kinesis Data Analytics (SQL) for quick aggregations.

    • Kinesis Data Analytics for Flink for advanced processing or ML models.

  4. Storage Layer:

    • Amazon S3 for data lake storage.

    • DynamoDB or Redshift for structured analytics.

  5. Consumption Layer:

    • Athena, QuickSight, or custom dashboards for insights.

    • Lambda or EventBridge triggers for downstream workflows.


Cheat Sheet

Component Purpose
KDA SQL Simple, SQL-based streaming analytics
KDA Flink Advanced, code-based streaming applications
MSK Kafka-based ingestion, high customization
Kinesis vs MSK AWS-native simplicity vs open-source control
Consumers Apps, Lambda, analytics engines
Big Data Pipeline Ingestion → Processing → Storage → Insights


Back to blog

Leave a comment