AWS Data & Analytics

July 29, 2025

Amazon Athena – Overview

Serverless, interactive query service to analyze data directly in Amazon S3 using SQL (Presto engine).
No ETL needed – Athena reads data in-place.
Pay-per-query: Charges based on data scanned (per TB).
Works well with:
- AWS Glue Data Catalog
- S3 data lakes
- Structured and semi-structured data (CSV, JSON, ORC, Parquet, Avro)
- Amazon CloudTrail logs, VPC Flow Logs, Application Logs

Amazon Athena – Performance Improvement Techniques

Athena performance is highly dependent on data format, partitioning, and query design.

✅ Best Practices

Technique	Benefit
1. Use Columnar Formats (Parquet, ORC)	Reduces data scanned; faster queries
2. Partition Your Data	Queries read only relevant partitions (e.g., partition by `year`, `month`, `region`)
3. Compress Data (Snappy, Gzip)	Reduces I/O cost and scanning
4. Optimize File Size	Use files between 128 MB – 1 GB for parallelism
5. Use Glue Data Catalog	Predefine schemas, avoid schema inference overhead
6. Avoid SELECT *	Scan only required columns to minimize data read
7. Optimize Joins	Use broadcast joins for small datasets
8. Use CTAS (Create Table As Select)	Store query results in optimized format for future queries
9. Use Workgroups	Set query limits, track performance and cost
10. Tune query filters	Push-down predicates to avoid full-table scans

💡 Example:
Instead of:

SELECT * FROM sales_data;

Use:

SELECT region, total_sales FROM sales_data WHERE year=2025 AND region='APAC';

Amazon Athena – Federated Query

Athena can query data across multiple sources, not just S3, using Athena Federated Query connectors.

✅ Supported Data Sources:

Amazon RDS (MySQL, PostgreSQL)
Amazon Redshift
Amazon DynamoDB
On-premises databases (via JDBC)
Cloud applications (e.g., Salesforce)

🔹 How It Works:

Deploy a Data Source Connector (Lambda function) from Athena Query Federation SDK.
Connector translates SQL into source-specific query.
Athena combines results with S3-based datasets.

🔹 Example Use Case:

You have:
- Historical data in S3
- Real-time transactions in Amazon RDS
Athena federates query:

SELECT s.customer_id, s.amount, r.last_login
FROM s3_sales s
JOIN rds_customers r
ON s.customer_id = r.customer_id
WHERE s.date > '2025-01-01';

✅ Result: A single query joins multiple data sources, serverless, no ETL.

📊 Athena Key Advantages

Serverless: No infrastructure to manage
Scalable: Auto-scales based on query size
Integrated with AWS ecosystem (Glue, QuickSight, Redshift Spectrum, Lake Formation)
Cost-effective for ad-hoc analytics

🔥 TL;DR:

Feature	Benefit
Athena	Query S3 data serverlessly using SQL
Performance Optimization	Use partitioning, Parquet/ORC, compression, selective queries
Federated Query	Query across S3, RDS, Redshift, DynamoDB, JDBC sources

1️⃣ Amazon Redshift – Overview

Amazon Redshift is a fully managed, petabyte-scale data warehouse service on AWS.
Designed for Online Analytical Processing (OLAP) workloads, i.e., complex queries on large datasets for analytics and reporting.
Based on PostgreSQL, optimized for columnar storage and parallel query execution.

2️⃣ OLTP vs OLAP

Feature	OLTP (Online Transaction Processing)	OLAP (Online Analytical Processing – Redshift)
Purpose	Handles real-time transactions	Handles complex analytical queries
Operations	Insert, update, delete, read (CRUD)	Aggregate, filter, join, analyze large datasets
Data Volume	GBs	TBs to PBs
Normalization	Highly normalized	Denormalized for faster reads
Example Services	Amazon RDS (MySQL, PostgreSQL, Aurora)	Amazon Redshift, Athena, BigQuery
Latency	Millisecond transactions	Seconds to minutes for analytics queries

✅ Redshift is OLAP → Ideal for BI, reporting, dashboards, and analytics.

3️⃣ Redshift Cluster

A cluster = One leader node + multiple compute nodes.
- Leader Node: Receives queries, optimizes execution plan, distributes work.
- Compute Nodes: Store data, process queries in parallel using Massively Parallel Processing (MPP).

Cluster Sizing:

Choose:
- Node type: Dense Storage (HDD) or Dense Compute (SSD).
- Number of nodes: More nodes = more compute & storage.
Auto-scaling is manual (need to resize or use RA3 with elastic scaling).

4️⃣ Snapshots & Disaster Recovery (DR)

Snapshots:
- Point-in-time backups stored in Amazon S3.
- Automatic snapshots taken periodically.
- Manual snapshots can be retained indefinitely.
Disaster Recovery (DR):
- Restore snapshot to any Redshift cluster in any AWS region.
- Enables cross-region replication for DR and data migration.
- Snapshots are incremental, saving only changes since the last backup.

5️⃣ Loading Data into Redshift

Best Practice: Use bulk inserts (COPY command) from:
- Amazon S3
- Amazon EMR
- Amazon DynamoDB
- Remote hosts via SSH

✅ Avoid many small inserts (INSERT INTO) because:

Redshift is optimized for batch loading.
Small inserts → performance degradation due to:
- High transaction overhead
- Skewed distribution
- Need for frequent VACUUM and ANALYZE operations

💡 Preferred approach:

COPY sales
FROM 's3://mybucket/sales_data/'
CREDENTIALS 'aws_iam_role=arn:aws:iam::1234567890:role/MyRedshiftRole'
CSV IGNOREHEADER 1;

Load in large files (1 MB to 1 GB each) for parallel ingestion.
Compress data (gzip, bzip2) to speed up load time and reduce costs.

6️⃣ Redshift Spectrum

Redshift Spectrum allows querying exabytes of data in S3 without loading it into Redshift.
Uses the same SQL syntax as Redshift.
Data is stored in open formats (Parquet, ORC, CSV, JSON).
How it works:
- Redshift queries data in cluster tables and S3 external tables (via Glue Data Catalog).
- Data processed in parallel across thousands of Redshift-managed query nodes.
- Combine hot data (in Redshift) with cold data (in S3) in a single query.

Example:

SELECT region, SUM(sales)
FROM spectrum.sales_s3
WHERE year = 2025
GROUP BY region;

✅ Benefits:

Cost-effective: Pay only for data scanned in S3.
Seamless integration: Avoids ETL for historical or infrequently accessed data.
Scalable: Query PB-scale datasets.

✅ Summary Table

Feature	Details
Workload type	OLAP – analytical queries on large datasets
Cluster	Leader + compute nodes, MPP architecture
Snapshots & DR	Automated backups, cross-region restore
Data Loading	Use COPY with large files; avoid small inserts
Redshift Spectrum	Query S3 data directly without loading into cluster

1️⃣ Amazon OpenSearch Service – Overview

Amazon OpenSearch Service (successor to Amazon Elasticsearch Service) is a fully managed search and analytics engine.
Supports:
- Full-text search
- Log analytics
- Real-time application monitoring
- Interactive data visualization (via OpenSearch Dashboards)

✅ Key Capabilities:

Index and search structured & unstructured data.
Near real-time data ingestion.
Scalable and highly available (multi-AZ support).
Integrates easily with AWS data sources (DynamoDB, CloudWatch, S3, Kinesis, Lambda).

2️⃣ OpenSearch Pattern – DynamoDB → OpenSearch

Use Case:

Enable fast search on non-key attributes of DynamoDB data.

Architecture:

DynamoDB Streams → Captures changes (INSERT, MODIFY, REMOVE).
AWS Lambda function → Triggers on DynamoDB stream events.
Lambda transforms and pushes data to Amazon OpenSearch Service index.

Flow:

DynamoDB → Streams → Lambda → OpenSearch Index → Search Query API

✅ Benefits:

DynamoDB is great for key-value lookups, but lacks complex text search.
OpenSearch provides:
- Full-text search
- Fuzzy matching
- Aggregations & filtering

Example Lambda Pseudocode:

from opensearchpy import OpenSearch
import boto3

client = OpenSearch(
    hosts=[{'host': 'my-domain.region.es.amazonaws.com', 'port': 443}],
    use_ssl=True
)

def lambda_handler(event, context):
    for record in event['Records']:
        doc_id = record['dynamodb']['Keys']['id']['S']
        document = record['dynamodb']['NewImage']
        client.index(index="products", id=doc_id, body=document)

3️⃣ OpenSearch Pattern – CloudWatch Logs → OpenSearch

Use Case:

Enable log analytics and visualization (similar to ELK stack) for operational monitoring.

Architecture:

CloudWatch Logs capture application/system logs.
CloudWatch Logs Subscriptions send data to:
1. AWS Lambda → transforms data → OpenSearch
2. OR Kinesis Firehose → delivers data to OpenSearch in bulk.
OpenSearch Dashboards visualize logs in near real-time.

Flow:

CloudWatch Logs → (Lambda/Kinesis) → OpenSearch Index → Dashboards

✅ Benefits:

Centralized log search for multi-service apps.
Pattern matching, anomaly detection on logs.
Low-latency insights for troubleshooting.

Example:

A subscription filter:

aws logs put-subscription-filter \
    --log-group-name my-app-logs \
    --filter-name myfilter \
    --filter-pattern "" \
    --destination-arn arn:aws:firehose:region:account-id:deliverystream/myStream

Kinesis Firehose → OpenSearch index → Real-time log search.

4️⃣ General OpenSearch Patterns

Pattern	Description	Example Services
Searchable Data Lake	Index S3 data for quick search	S3 → Lambda → OpenSearch
Centralized Logging	Collect logs from multiple services for analysis	CloudWatch, Lambda, ECS logs → OpenSearch
E-commerce Search	Product catalog search with typo tolerance, autocomplete	DynamoDB → OpenSearch → API Gateway
Security Analytics (SIEM)	Analyze VPC Flow Logs, GuardDuty, WAF logs	CloudWatch Logs → OpenSearch → Kibana Dashboards
Application Monitoring (APM)	Ingest app metrics, errors, traces	X-Ray or CloudWatch Metrics → OpenSearch
Real-time Recommendation Engine	Query large datasets fast	DynamoDB + OpenSearch for non-key searches

✅ Summary Table

Integration Pattern	Key Benefit
DynamoDB → OpenSearch	Advanced search on NoSQL data
CloudWatch Logs → OpenSearch	Real-time log analytics & monitoring
Data Lake → OpenSearch	Search over structured/unstructured S3 data
Security Event Logs → OpenSearch	Threat detection, compliance checks

1️⃣ Amazon Kinesis Overview

Amazon Kinesis is a real-time, fully managed data streaming platform that enables you to collect, process, and analyze large streams of data.

It has multiple services:

Kinesis Data Streams (KDS) → Real-time custom data processing
Kinesis Data Firehose (KDF) → Load streaming data into AWS destinations (no code needed)
Kinesis Data Analytics → Run SQL queries on streaming data
Kinesis Video Streams → Stream video data

2️⃣ Kinesis Data Streams (KDS)

Definition:

A real-time streaming data ingestion and processing service.
Lets you build custom consumers (apps) to process data within milliseconds of arrival.

Architecture:

Producers: Send data records to Kinesis Stream (IoT devices, apps, services).
Kinesis Stream: Organizes data into shards (units of capacity).
Consumers: Applications (AWS Lambda, EC2, KCL apps) process and store data.

Key Features:

Real-time latency: Data available for processing in <1 second.
Custom processing: Flexible consumer apps.
Data retention: 24 hours to 365 days.
Ordering: Guaranteed order per shard.
Scaling: Add/remove shards for throughput scaling.

Example Use Case:

Stock price tracking: Capture millions of market events per second.
Fraud detection: Real-time pattern analysis of transactions.

3️⃣ Kinesis Data Firehose (KDF)

Definition:

A fully managed service for loading streaming data to AWS destinations without building consumers.
No code required, handles scaling, buffering, and retries automatically.

Architecture:

Producers: Send data (apps, IoT, CloudWatch Logs, etc.).
Firehose delivery stream: Buffers and optionally transforms data.
Destinations: Amazon S3, Amazon Redshift, Amazon OpenSearch Service, or custom HTTP endpoints.

Key Features:

Near real-time: Latency is ~1 min (buffer interval).
Fully managed: No infrastructure to manage.
Automatic scaling: Handles any throughput.
Optional transformation: Via AWS Lambda before delivery.
Compression & encryption: To reduce costs and improve security.

Example Use Case:

Centralized log storage: Automatically collect app logs and store them in S3 for later analysis.
ETL pipelines: Stream events directly into Redshift for analytics.

4️⃣ Kinesis Data Streams vs. Firehose

Feature	Kinesis Data Streams (KDS)	Kinesis Data Firehose (KDF)
Processing	Build custom apps or Lambda for real-time	Automatic delivery, minimal processing
Latency	< 1 sec	~1 min (buffered)
Data Retention	24 hours – 365 days	No retention (data immediately delivered)
Scaling	Manual (add shards)	Automatic
Destinations	Anywhere (custom consumers)	S3, Redshift, OpenSearch, HTTP endpoint
Use Case	Real-time analytics, event-driven apps	Data lake, log aggregation, archiving
Cost	Pay per shard + PUT payload	Pay per data volume ingested

5️⃣ When to Use Each?

✅ Use KDS when:

You need real-time, low-latency processing.
You want multiple consumers processing the same data stream.
You need custom logic before storing or forwarding data.

✅ Use KDF when:

You need easy ingestion → storage with minimal coding.
You want automatic, reliable delivery to S3, Redshift, or OpenSearch.
You are okay with slight delay (buffering) instead of millisecond latency.

6️⃣ Integration Example

IoT sensors → KDS → AWS Lambda → Real-time dashboard.
Same data → Firehose → S3 for historical data lake → Athena/Redshift for BI.

1️⃣ Amazon EMR – Overview

Amazon EMR (Elastic MapReduce) is a managed big data platform that allows you to process vast amounts of data using open-source frameworks like:

Apache Hadoop
Apache Spark
Apache Hive
Apache HBase
Presto
Flink

✅ Key Features:

Easily run big data frameworks without manual cluster setup.
Scalable clusters with automatic resizing.
Cost-effective via Spot instances and auto-termination.
Integrates with S3, DynamoDB, Redshift, RDS, and Kinesis for data storage and analytics.

Amazon EMR – Node Types

An EMR cluster consists of 3 types of nodes:

(a) Master Node

Role: Manages the cluster, coordinates tasks, and monitors health.
Runs:
- Cluster Manager (YARN, Mesos, etc.)
- Hadoop NameNode
High availability: Optional multi-master for fault tolerance.

(b) Core Node

Role: Runs data processing tasks and stores data in HDFS.
Mandatory: At least one core node is required.
Persistent storage: Data in core nodes is replicated across nodes.

(c) Task Node (Optional)

Role: Used only for processing tasks, does not store data.
Elastic scaling: You can add/remove task nodes based on demand (e.g., Spot instances).

Node Summary Table

Node Type	Function	Data Storage	Mandatory
Master	Cluster coordination, task scheduling	No	Yes
Core	Task execution + HDFS storage	Yes	Yes
Task	Task execution only (scaling compute)	No	Optional

Amazon EMR – Node Purchasing Options

EMR runs on Amazon EC2 instances. You can choose from multiple pricing models:

(a) On-Demand Instances

Pay for EC2 capacity by the hour or second.
No long-term commitment.
Best for: Short-lived, unpredictable workloads.

(b) Reserved Instances

Commit to 1-year or 3-year term for a lower hourly rate.
Best for: Steady-state big data workloads with predictable usage.

(c) Spot Instances

Purchase unused EC2 capacity at up to 90% discount.
Best for: Task nodes or fault-tolerant workloads (non-critical jobs).
Risk: Instance may be terminated when AWS reclaims capacity.

(d) Instance Fleets (Mixed purchasing)

Combine On-Demand + Reserved + Spot instances for cost optimization.
Flexible across multiple EC2 instance types for maximum cluster availability.

✅ Summary Cheat Sheet

Node Type	Role	Storage	Mandatory
Master	Manages cluster & tasks	No	Yes
Core	Runs tasks + stores data in HDFS	Yes	Yes
Task	Runs tasks only	No	Optional

Pricing Option	Use Case
On-Demand	Short, unpredictable jobs
Reserved	Long-term, predictable usage
Spot	Flexible, fault-tolerant jobs
Instance Fleets	Mix of all for cost savings

Here’s a structured explanation of Amazon QuickSight, its integrations, and dashboard/analysis features for your Cert Tutorials:

1️⃣ Amazon QuickSight – Overview

Amazon QuickSight is a serverless, cloud-based business intelligence (BI) service that allows users to:

Connect to multiple data sources.
Perform data analysis and visualization.
Build interactive dashboards accessible via browsers and mobile devices.
Use machine learning insights for anomaly detection and forecasting.

✅ Key Benefits:

Fully managed, no infrastructure to provision.
Pay-per-session pricing (only pay when users access dashboards).
Scales to thousands of users instantly.

QuickSight – Integrations

QuickSight can connect to a wide range of AWS services, on-premises, and third-party data sources, including:

a) AWS Native Integrations

Amazon S3 → CSV, JSON, Parquet files.
Amazon RDS (MySQL, PostgreSQL, Oracle, SQL Server).
Amazon Aurora.
Amazon Redshift.
Amazon Athena → Query data in S3 using SQL.
Amazon OpenSearch Service.
AWS IoT Analytics.
AWS Glue Data Catalog.

b) Other Cloud and External Integrations

Salesforce.
ServiceNow.
Twitter.
Jira.
Adobe Analytics.

c) On-Premises Databases

MySQL, PostgreSQL, Microsoft SQL Server, Oracle, MariaDB via JDBC/ODBC connection or AWS Direct Connect/VPN.

d) File-based Sources

CSV, XLSX files uploaded directly to QuickSight.

3️⃣ QuickSight – Dashboard & Analysis

a) Analysis in QuickSight

An Analysis is where you:

Prepare and explore datasets.
Build charts, graphs, and tables.
Apply filters, calculated fields, and aggregations.
Use ML-powered insights (forecasting, anomaly detection).

✅ You can create:

Bar charts, line charts, pie charts.
Heat maps, pivot tables.
KPI widgets.
Geospatial maps.

Once the analysis is ready, you can publish it as a dashboard.

b) Dashboards in QuickSight

A Dashboard is:

A read-only, interactive view shared with other users.
Based on an analysis you created.
Auto-updated when data changes.

Key Features:

Drill-down and filter options.
Embedded dashboards in applications using QuickSight SDK.
Accessible via web browser or mobile app.
Supports row-level security to restrict data per user.
Can be scheduled for email reports.

Example Use Case:

Data in S3 → Athena → QuickSight analysis.
Build a sales performance dashboard.
Share with the sales team with role-based access.
Enable ML-powered forecasting for upcoming sales trends.

✅ QuickSight Cheat Sheet

Feature	Description
Integrations	S3, Athena, RDS, Redshift, OpenSearch, Salesforce, On-prem DBs
Analysis	Data exploration, visualization building, ML insights
Dashboard	Published interactive, read-only reports for users
Access Control	Row-level security and IAM integration
Pricing	Pay-per-user or pay-per-session

AWS Glue – Overview

AWS Glue is a serverless ETL (Extract, Transform, Load) and data integration service that helps you:

Discover, prepare, and combine data from multiple sources.
Transform raw data into usable formats for analytics and machine learning.
Build data lakes, data warehouses, and analytics pipelines.

AWS Glue – Convert Data into Parquet Format

Parquet is a columnar storage format that provides:

Better query performance than row-based formats (like CSV or JSON).
Efficient storage, with built-in compression.

How AWS Glue Converts Data to Parquet:

Glue reads source data (CSV, JSON, relational DB, etc.).
Uses a Glue ETL Job (PySpark or Scala) to process and transform the data.
Writes the output to Amazon S3 in Parquet format.

Example ETL Flow

Source Data → Glue Crawler → Glue Job → Convert to Parquet → Store in S3 → Query via Athena/Redshift Spectrum

✅ Benefits of Parquet conversion in Glue:

Optimized for big data analytics.
Works well with Athena, Redshift Spectrum, EMR, and QuickSight.
Reduces storage cost and query execution time.

3️⃣ AWS Glue Data Catalog

The Glue Data Catalog is a central metadata repository that stores:

Table definitions
Schema information
Partition locations
Data source connections

Key Points:

Works as a "Hive Metastore" replacement for AWS.
AWS Athena, Redshift Spectrum, EMR, and QuickSight rely on it to understand data structure.
Populated using Glue Crawlers:
- Crawlers scan S3 or databases.
- Automatically infer schema and store metadata.

Example Use Case:

Data stored in S3 in different formats.
Glue crawler detects tables and schema → Creates catalog entries.
Athena queries data using standard SQL without manual schema setup.

4️⃣ AWS Glue – Things to Know (High-Level)

Serverless: No infrastructure to manage; scales automatically.
ETL Jobs: Written in PySpark or Scala, run in a managed Apache Spark environment.
Job Triggers: Can be scheduled or event-driven (e.g., S3 file upload).
Integration: Works seamlessly with S3, Redshift, Athena, Lake Formation, DynamoDB.
Job Bookmarks: Prevent reprocessing of previously processed data.
Pricing: Based on Data Processing Units (DPUs) used per job.
Glue Studio: Provides a visual interface to build ETL pipelines without coding.
AWS Glue DataBrew: For low-code/no-code data preparation tasks.

✅ Quick Cheat Sheet

Feature	Description
Parquet Conversion	ETL job converts data to Parquet for better query performance.
Data Catalog	Central repository of dataset schemas for AWS analytics services
Serverless ETL	Scalable data transformation with PySpark or visual workflows
Integration	Works with S3, Athena, Redshift, EMR, QuickSight
Cost	Pay-per-DPU, serverless

AWS Lake Formation – Overview

AWS Lake Formation is a managed service for building secure data lakes on AWS.
It helps you:

Collect, catalog, clean, secure, and share data from multiple sources.
Enforce fine-grained permissions (table, column, row-level) for analytics.
Manage centralized access control across services like Athena, Redshift Spectrum, EMR, Glue, and QuickSight.

2️⃣ Key Components of Lake Formation

Data Lake: Data stored in Amazon S3 in raw or processed format.
Glue Data Catalog: Metadata repository (integrates with Lake Formation).
Permissions & Policies: Grant granular access to databases, tables, or even columns.
Data Sharing: Securely share datasets across AWS accounts or organizations.

3️⃣ Centralized Permissions in AWS Lake Formation

Lake Formation allows centralized access control for all data lake users.
Permissions are defined once in Lake Formation and apply to all integrated services.

Types of Permissions:

Data Location Permissions: Access to underlying S3 buckets.
Database & Table Permissions:
- SELECT, ALTER, DROP, CREATE, DESCRIBE
- GRANT/REVOKE to other principals.
Column-level Permissions: Limit access to specific fields (e.g., hide salary column).
Row-level Permissions: Provide conditional access to subsets of data.

4️⃣ Example – Centralized Permissions

Scenario:

A retail company stores sales data in Amazon S3 (sales_db database).
They have three teams: Finance, Marketing, and Data Science.
Data is queried using Athena and visualized in QuickSight.

Step 1: Set up the Data Lake

Data ingested from multiple sources into S3 bucket.
AWS Glue Crawler populates Data Catalog with sales_db and transactions table.

Step 2: Define Permissions in Lake Formation

Finance Team:
- Full access to all columns.
- SQL query example in Athena:
```
SELECT * FROM sales_db.transactions;
```
Marketing Team:
- Access only to region, product, and revenue columns.
- Deny sensitive fields (e.g., customer_email).
- Permission granted at column level:
```
Permissions: SELECT (region, product, revenue)
```
Data Science Team:
- Row-level filtering: Access only to transactions where region = 'US'.
- Example policy:
```
Filter: region = 'US'
Permissions: SELECT *
```

Step 3: Users Query Data via Integrated Services

Athena: Users run queries but only see data allowed by permissions.
QuickSight: Dashboards show restricted data based on team access.
Redshift Spectrum or EMR: Same permissions enforced automatically.

✅ Benefits of Centralized Permissions

Single source of truth: Permissions managed centrally in Lake Formation.
Consistent security: Applied across Athena, Redshift Spectrum, EMR, and QuickSight.
Granular access: Table, column, and row-level controls.
Cross-account sharing: Securely share data with other AWS accounts.

✅ Cheat Sheet

Feature	Description
Centralized Access	One permission model for all analytics services
Column & Row-level	Restrict sensitive fields or specific dataset portions
S3 Integration	Works with Glue Data Catalog + S3 as data lake
Cross-account	Share data securely with other AWS accounts
Supported Services	Athena, Redshift Spectrum, EMR, Glue, QuickSight

Kinesis Data Analytics (KDA)

AWS Kinesis Data Analytics (KDA) is a serverless service that allows you to analyze streaming data in real-time using:

SQL-based applications or
Apache Flink applications

It processes data from sources like Kinesis Data Streams or Amazon MSK (Kafka) and outputs results to destinations like S3, DynamoDB, Lambda, or Redshift.

Kinesis Data Analytics for SQL Applications

Enables real-time analytics on streaming data using standard SQL queries.
Ideal for developers familiar with SQL but not programming frameworks.
Can filter, aggregate, join, and transform streaming data.

Example Use Case:

An e-commerce website streams order events into Kinesis Data Streams.
A KDA SQL application calculates total sales per product every minute.
Output is sent to S3 or Redshift for BI dashboards.

3️⃣ Kinesis Data Analytics (SQL Application)

Key Features:

Continuous SQL queries: Runs non-stop on incoming data.
Windowed operations: Allows time-based aggregations.
Real-time dashboards: Monitor trends and anomalies instantly.

Example Query:

CREATE OR REPLACE STREAM "DESTINATION_SQL_STREAM" 
AS SELECT product_id, SUM(price) AS total_sales
FROM "SOURCE_SQL_STREAM_001"
GROUP BY product_id, STEP("INTERVAL" BY INTERVAL '1' MINUTE);

✅ Benefit: Low-code solution for streaming ETL and analytics.

4️⃣ Kinesis Data Analytics for Apache Flink

Enables advanced stream processing using Apache Flink, an open-source, distributed framework.
Supports Java, Scala, and Python applications for complex use cases:
- Event-time processing
- Stateful stream applications
- Machine learning on streaming data

Example Use Case:

Fraud detection on payment transactions using Flink to:
- Correlate events across time windows
- Detect anomalies in real time

✅ Benefit: High flexibility & advanced capabilities compared to SQL-based apps.

Amazon Managed Streaming for Apache Kafka (MSK)

Amazon MSK is a fully managed service for Apache Kafka, a popular open-source event streaming platform.

Handles high-throughput, real-time data ingestion.
Provides durable storage of streams with strong ordering guarantees.
Used when organizations already use Kafka or need more control than Kinesis.

Key Features:

AWS manages Kafka cluster provisioning, patching, scaling, and monitoring.
Integrates with EC2, Lambda, ECS, EKS, and on-premise apps.
MSK Serverless option for automatic scaling.

6️⃣ Kinesis Data Streams vs. Amazon MSK

Feature	Kinesis Data Streams	Amazon MSK (Kafka)
Type	AWS-native managed streaming service	Managed Apache Kafka (open source)
Setup	Fully serverless	Requires creating Kafka clusters
Throughput Scaling	Automatic with shard scaling	Manual or via autoscaling
Ecosystem	Works tightly with AWS services	Open-source Kafka ecosystem supported
Latency	Milliseconds	Low milliseconds
Use Case	AWS-first, quick setup, predictable workloads	Existing Kafka clients, hybrid environments

✅ Choose Kinesis for AWS-native, simpler workloads.
✅ Choose MSK if you need Kafka-specific features or existing Kafka applications.

7️⃣ Amazon MSK Consumers

Consumers are applications that read data from Kafka topics.
Examples:
- AWS Lambda function triggered by Kafka events.
- EC2/ECS/EKS services consuming topic messages.
- Kinesis Data Analytics for Apache Flink reading from MSK for processing.
- Third-party tools (e.g., Logstash, Spark Streaming) integrated via Kafka APIs.

8️⃣ Big Data Ingestion Pipeline (Example Architecture)

Goal: Process millions of incoming data events (e.g., IoT sensor data, clickstreams) in near real time.

Step-by-Step Flow:

Data Producers:
- Mobile apps, IoT devices, websites generate events.
Streaming Ingestion Layer:
- Kinesis Data Streams or MSK ingests data.
Processing Layer:
- Kinesis Data Analytics (SQL) for quick aggregations.
- Kinesis Data Analytics for Flink for advanced processing or ML models.
Storage Layer:
- Amazon S3 for data lake storage.
- DynamoDB or Redshift for structured analytics.
Consumption Layer:
- Athena, QuickSight, or custom dashboards for insights.
- Lambda or EventBridge triggers for downstream workflows.

✅ Cheat Sheet

Component	Purpose
KDA SQL	Simple, SQL-based streaming analytics
KDA Flink	Advanced, code-based streaming applications
MSK	Kafka-based ingestion, high customization
Kinesis vs MSK	AWS-native simplicity vs open-source control
Consumers	Apps, Lambda, analytics engines
Big Data Pipeline	Ingestion → Processing → Storage → Insights

Back to blog