AWS Data & Analytics
Amazon Athena – Overview
- Serverless, interactive query service to analyze data directly in Amazon S3 using SQL (Presto engine).
- No ETL needed – Athena reads data in-place.
- Pay-per-query: Charges based on data scanned (per TB).
- Works well with:
- AWS Glue Data Catalog
- S3 data lakes
- Structured and semi-structured data (CSV, JSON, ORC, Parquet, Avro)
- Amazon CloudTrail logs, VPC Flow Logs, Application Logs
Amazon Athena – Performance Improvement Techniques
Athena performance is highly dependent on data format, partitioning, and query design.
✅ Best Practices
Technique | Benefit |
---|---|
1. Use Columnar Formats (Parquet, ORC) | Reduces data scanned; faster queries |
2. Partition Your Data | Queries read only relevant partitions (e.g., partition by year , month , region ) |
3. Compress Data (Snappy, Gzip) | Reduces I/O cost and scanning |
4. Optimize File Size | Use files between 128 MB – 1 GB for parallelism |
5. Use Glue Data Catalog | Predefine schemas, avoid schema inference overhead |
**6. Avoid SELECT *** | Scan only required columns to minimize data read |
7. Optimize Joins | Use broadcast joins for small datasets |
8. Use CTAS (Create Table As Select) | Store query results in optimized format for future queries |
9. Use Workgroups | Set query limits, track performance and cost |
10. Tune query filters | Push-down predicates to avoid full-table scans |
💡 Example:
Instead of:
SELECT * FROM sales_data;
Use:
SELECT region, total_sales FROM sales_data WHERE year=2025 AND region='APAC';
Amazon Athena – Federated Query
Athena can query data across multiple sources, not just S3, using Athena Federated Query connectors.
✅ Supported Data Sources:
- Amazon RDS (MySQL, PostgreSQL)
- Amazon Redshift
- Amazon DynamoDB
- On-premises databases (via JDBC)
- Cloud applications (e.g., Salesforce)
🔹 How It Works:
- Deploy a Data Source Connector (Lambda function) from Athena Query Federation SDK.
- Connector translates SQL into source-specific query.
- Athena combines results with S3-based datasets.
🔹 Example Use Case:
- You have:
- Historical data in S3
- Real-time transactions in Amazon RDS
- Athena federates query:
SELECT s.customer_id, s.amount, r.last_login
FROM s3_sales s
JOIN rds_customers r
ON s.customer_id = r.customer_id
WHERE s.date > '2025-01-01';
✅ Result: A single query joins multiple data sources, serverless, no ETL.
📊 Athena Key Advantages
- Serverless: No infrastructure to manage
- Scalable: Auto-scales based on query size
- Integrated with AWS ecosystem (Glue, QuickSight, Redshift Spectrum, Lake Formation)
- Cost-effective for ad-hoc analytics
🔥 TL;DR:
Feature | Benefit |
---|---|
Athena | Query S3 data serverlessly using SQL |
Performance Optimization | Use partitioning, Parquet/ORC, compression, selective queries |
Federated Query | Query across S3, RDS, Redshift, DynamoDB, JDBC sources |
1️⃣ Amazon Redshift – Overview
- Amazon Redshift is a fully managed, petabyte-scale data warehouse service on AWS.
- Designed for Online Analytical Processing (OLAP) workloads, i.e., complex queries on large datasets for analytics and reporting.
- Based on PostgreSQL, optimized for columnar storage and parallel query execution.
2️⃣ OLTP vs OLAP
Feature | OLTP (Online Transaction Processing) | OLAP (Online Analytical Processing – Redshift) |
---|---|---|
Purpose | Handles real-time transactions | Handles complex analytical queries |
Operations | Insert, update, delete, read (CRUD) | Aggregate, filter, join, analyze large datasets |
Data Volume | GBs | TBs to PBs |
Normalization | Highly normalized | Denormalized for faster reads |
Example Services | Amazon RDS (MySQL, PostgreSQL, Aurora) | Amazon Redshift, Athena, BigQuery |
Latency | Millisecond transactions | Seconds to minutes for analytics queries |
✅ Redshift is OLAP → Ideal for BI, reporting, dashboards, and analytics.
3️⃣ Redshift Cluster
- A cluster = One leader node + multiple compute nodes.
- Leader Node: Receives queries, optimizes execution plan, distributes work.
- Compute Nodes: Store data, process queries in parallel using Massively Parallel Processing (MPP).
Cluster Sizing:
- Choose:
- Node type: Dense Storage (HDD) or Dense Compute (SSD).
- Number of nodes: More nodes = more compute & storage.
- Auto-scaling is manual (need to resize or use RA3 with elastic scaling).
4️⃣ Snapshots & Disaster Recovery (DR)
-
Snapshots:
- Point-in-time backups stored in Amazon S3.
- Automatic snapshots taken periodically.
- Manual snapshots can be retained indefinitely.
-
Disaster Recovery (DR):
- Restore snapshot to any Redshift cluster in any AWS region.
- Enables cross-region replication for DR and data migration.
- Snapshots are incremental, saving only changes since the last backup.
5️⃣ Loading Data into Redshift
-
Best Practice: Use bulk inserts (COPY command) from:
- Amazon S3
- Amazon EMR
- Amazon DynamoDB
- Remote hosts via SSH
✅ Avoid many small inserts (INSERT INTO) because:
- Redshift is optimized for batch loading.
-
Small inserts → performance degradation due to:
- High transaction overhead
- Skewed distribution
- Need for frequent VACUUM and ANALYZE operations
💡 Preferred approach:
COPY sales
FROM 's3://mybucket/sales_data/'
CREDENTIALS 'aws_iam_role=arn:aws:iam::1234567890:role/MyRedshiftRole'
CSV IGNOREHEADER 1;
- Load in large files (1 MB to 1 GB each) for parallel ingestion.
- Compress data (gzip, bzip2) to speed up load time and reduce costs.
6️⃣ Redshift Spectrum
- Redshift Spectrum allows querying exabytes of data in S3 without loading it into Redshift.
- Uses the same SQL syntax as Redshift.
- Data is stored in open formats (Parquet, ORC, CSV, JSON).
-
How it works:
- Redshift queries data in cluster tables and S3 external tables (via Glue Data Catalog).
- Data processed in parallel across thousands of Redshift-managed query nodes.
- Combine hot data (in Redshift) with cold data (in S3) in a single query.
Example:
SELECT region, SUM(sales)
FROM spectrum.sales_s3
WHERE year = 2025
GROUP BY region;
✅ Benefits:
- Cost-effective: Pay only for data scanned in S3.
- Seamless integration: Avoids ETL for historical or infrequently accessed data.
- Scalable: Query PB-scale datasets.
✅ Summary Table
Feature | Details |
---|---|
Workload type | OLAP – analytical queries on large datasets |
Cluster | Leader + compute nodes, MPP architecture |
Snapshots & DR | Automated backups, cross-region restore |
Data Loading | Use COPY with large files; avoid small inserts |
Redshift Spectrum | Query S3 data directly without loading into cluster |
1️⃣ Amazon OpenSearch Service – Overview
- Amazon OpenSearch Service (successor to Amazon Elasticsearch Service) is a fully managed search and analytics engine.
- Supports:
- Full-text search
- Log analytics
- Real-time application monitoring
- Interactive data visualization (via OpenSearch Dashboards)
✅ Key Capabilities:
- Index and search structured & unstructured data.
- Near real-time data ingestion.
- Scalable and highly available (multi-AZ support).
- Integrates easily with AWS data sources (DynamoDB, CloudWatch, S3, Kinesis, Lambda).
2️⃣ OpenSearch Pattern – DynamoDB → OpenSearch
Use Case:
Enable fast search on non-key attributes of DynamoDB data.
Architecture:
- DynamoDB Streams → Captures changes (INSERT, MODIFY, REMOVE).
- AWS Lambda function → Triggers on DynamoDB stream events.
- Lambda transforms and pushes data to Amazon OpenSearch Service index.
Flow:
DynamoDB → Streams → Lambda → OpenSearch Index → Search Query API
✅ Benefits:
- DynamoDB is great for key-value lookups, but lacks complex text search.
- OpenSearch provides:
- Full-text search
- Fuzzy matching
- Aggregations & filtering
Example Lambda Pseudocode:
from opensearchpy import OpenSearch
import boto3
client = OpenSearch(
hosts=[{'host': 'my-domain.region.es.amazonaws.com', 'port': 443}],
use_ssl=True
)
def lambda_handler(event, context):
for record in event['Records']:
doc_id = record['dynamodb']['Keys']['id']['S']
document = record['dynamodb']['NewImage']
client.index(index="products", id=doc_id, body=document)
3️⃣ OpenSearch Pattern – CloudWatch Logs → OpenSearch
Use Case:
Enable log analytics and visualization (similar to ELK stack) for operational monitoring.
Architecture:
- CloudWatch Logs capture application/system logs.
-
CloudWatch Logs Subscriptions send data to:
- AWS Lambda → transforms data → OpenSearch
- OR Kinesis Firehose → delivers data to OpenSearch in bulk.
- OpenSearch Dashboards visualize logs in near real-time.
Flow:
CloudWatch Logs → (Lambda/Kinesis) → OpenSearch Index → Dashboards
✅ Benefits:
- Centralized log search for multi-service apps.
- Pattern matching, anomaly detection on logs.
- Low-latency insights for troubleshooting.
Example:
-
A subscription filter:
aws logs put-subscription-filter \
--log-group-name my-app-logs \
--filter-name myfilter \
--filter-pattern "" \
--destination-arn arn:aws:firehose:region:account-id:deliverystream/myStream
-
Kinesis Firehose → OpenSearch index → Real-time log search.
4️⃣ General OpenSearch Patterns
Pattern | Description | Example Services |
---|---|---|
Searchable Data Lake | Index S3 data for quick search | S3 → Lambda → OpenSearch |
Centralized Logging | Collect logs from multiple services for analysis | CloudWatch, Lambda, ECS logs → OpenSearch |
E-commerce Search | Product catalog search with typo tolerance, autocomplete | DynamoDB → OpenSearch → API Gateway |
Security Analytics (SIEM) | Analyze VPC Flow Logs, GuardDuty, WAF logs | CloudWatch Logs → OpenSearch → Kibana Dashboards |
Application Monitoring (APM) | Ingest app metrics, errors, traces | X-Ray or CloudWatch Metrics → OpenSearch |
Real-time Recommendation Engine | Query large datasets fast | DynamoDB + OpenSearch for non-key searches |
✅ Summary Table
Integration Pattern | Key Benefit |
---|---|
DynamoDB → OpenSearch | Advanced search on NoSQL data |
CloudWatch Logs → OpenSearch | Real-time log analytics & monitoring |
Data Lake → OpenSearch | Search over structured/unstructured S3 data |
Security Event Logs → OpenSearch | Threat detection, compliance checks |
1️⃣ Amazon Kinesis Overview
Amazon Kinesis is a real-time, fully managed data streaming platform that enables you to collect, process, and analyze large streams of data.
It has multiple services:
- Kinesis Data Streams (KDS) → Real-time custom data processing
- Kinesis Data Firehose (KDF) → Load streaming data into AWS destinations (no code needed)
- Kinesis Data Analytics → Run SQL queries on streaming data
- Kinesis Video Streams → Stream video data
2️⃣ Kinesis Data Streams (KDS)
Definition:
- A real-time streaming data ingestion and processing service.
- Lets you build custom consumers (apps) to process data within milliseconds of arrival.
Architecture:
- Producers: Send data records to Kinesis Stream (IoT devices, apps, services).
- Kinesis Stream: Organizes data into shards (units of capacity).
- Consumers: Applications (AWS Lambda, EC2, KCL apps) process and store data.
Key Features:
- Real-time latency: Data available for processing in <1 second.
- Custom processing: Flexible consumer apps.
- Data retention: 24 hours to 365 days.
- Ordering: Guaranteed order per shard.
- Scaling: Add/remove shards for throughput scaling.
Example Use Case:
- Stock price tracking: Capture millions of market events per second.
- Fraud detection: Real-time pattern analysis of transactions.
3️⃣ Kinesis Data Firehose (KDF)
Definition:
- A fully managed service for loading streaming data to AWS destinations without building consumers.
- No code required, handles scaling, buffering, and retries automatically.
Architecture:
- Producers: Send data (apps, IoT, CloudWatch Logs, etc.).
- Firehose delivery stream: Buffers and optionally transforms data.
- Destinations: Amazon S3, Amazon Redshift, Amazon OpenSearch Service, or custom HTTP endpoints.
Key Features:
- Near real-time: Latency is ~1 min (buffer interval).
- Fully managed: No infrastructure to manage.
- Automatic scaling: Handles any throughput.
- Optional transformation: Via AWS Lambda before delivery.
- Compression & encryption: To reduce costs and improve security.
Example Use Case:
- Centralized log storage: Automatically collect app logs and store them in S3 for later analysis.
- ETL pipelines: Stream events directly into Redshift for analytics.
4️⃣ Kinesis Data Streams vs. Firehose
Feature | Kinesis Data Streams (KDS) | Kinesis Data Firehose (KDF) |
---|---|---|
Processing | Build custom apps or Lambda for real-time | Automatic delivery, minimal processing |
Latency | < 1 sec | ~1 min (buffered) |
Data Retention | 24 hours – 365 days | No retention (data immediately delivered) |
Scaling | Manual (add shards) | Automatic |
Destinations | Anywhere (custom consumers) | S3, Redshift, OpenSearch, HTTP endpoint |
Use Case | Real-time analytics, event-driven apps | Data lake, log aggregation, archiving |
Cost | Pay per shard + PUT payload | Pay per data volume ingested |
5️⃣ When to Use Each?
✅ Use KDS when:
- You need real-time, low-latency processing.
- You want multiple consumers processing the same data stream.
- You need custom logic before storing or forwarding data.
✅ Use KDF when:
- You need easy ingestion → storage with minimal coding.
- You want automatic, reliable delivery to S3, Redshift, or OpenSearch.
- You are okay with slight delay (buffering) instead of millisecond latency.
6️⃣ Integration Example
- IoT sensors → KDS → AWS Lambda → Real-time dashboard.
- Same data → Firehose → S3 for historical data lake → Athena/Redshift for BI.
1️⃣ Amazon EMR – Overview
Amazon EMR (Elastic MapReduce) is a managed big data platform that allows you to process vast amounts of data using open-source frameworks like:
- Apache Hadoop
- Apache Spark
- Apache Hive
- Apache HBase
- Presto
- Flink
✅ Key Features:
- Easily run big data frameworks without manual cluster setup.
- Scalable clusters with automatic resizing.
- Cost-effective via Spot instances and auto-termination.
- Integrates with S3, DynamoDB, Redshift, RDS, and Kinesis for data storage and analytics.
Amazon EMR – Node Types
An EMR cluster consists of 3 types of nodes:
(a) Master Node
- Role: Manages the cluster, coordinates tasks, and monitors health.
- Runs:
- Cluster Manager (YARN, Mesos, etc.)
- Hadoop NameNode
- High availability: Optional multi-master for fault tolerance.
(b) Core Node
- Role: Runs data processing tasks and stores data in HDFS.
- Mandatory: At least one core node is required.
- Persistent storage: Data in core nodes is replicated across nodes.
(c) Task Node (Optional)
- Role: Used only for processing tasks, does not store data.
- Elastic scaling: You can add/remove task nodes based on demand (e.g., Spot instances).
Node Summary Table
Node Type | Function | Data Storage | Mandatory |
---|---|---|---|
Master | Cluster coordination, task scheduling | No | Yes |
Core | Task execution + HDFS storage | Yes | Yes |
Task | Task execution only (scaling compute) | No | Optional |
Amazon EMR – Node Purchasing Options
EMR runs on Amazon EC2 instances. You can choose from multiple pricing models:
(a) On-Demand Instances
- Pay for EC2 capacity by the hour or second.
- No long-term commitment.
- Best for: Short-lived, unpredictable workloads.
(b) Reserved Instances
- Commit to 1-year or 3-year term for a lower hourly rate.
- Best for: Steady-state big data workloads with predictable usage.
(c) Spot Instances
- Purchase unused EC2 capacity at up to 90% discount.
- Best for: Task nodes or fault-tolerant workloads (non-critical jobs).
- Risk: Instance may be terminated when AWS reclaims capacity.
(d) Instance Fleets (Mixed purchasing)
- Combine On-Demand + Reserved + Spot instances for cost optimization.
- Flexible across multiple EC2 instance types for maximum cluster availability.
✅ Summary Cheat Sheet
Node Type | Role | Storage | Mandatory |
---|---|---|---|
Master | Manages cluster & tasks | No | Yes |
Core | Runs tasks + stores data in HDFS | Yes | Yes |
Task | Runs tasks only | No | Optional |
Pricing Option | Use Case |
---|---|
On-Demand | Short, unpredictable jobs |
Reserved | Long-term, predictable usage |
Spot | Flexible, fault-tolerant jobs |
Instance Fleets | Mix of all for cost savings |
Here’s a structured explanation of Amazon QuickSight, its integrations, and dashboard/analysis features for your Cert Tutorials:
1️⃣ Amazon QuickSight – Overview
Amazon QuickSight is a serverless, cloud-based business intelligence (BI) service that allows users to:
- Connect to multiple data sources.
- Perform data analysis and visualization.
- Build interactive dashboards accessible via browsers and mobile devices.
- Use machine learning insights for anomaly detection and forecasting.
✅ Key Benefits:
- Fully managed, no infrastructure to provision.
- Pay-per-session pricing (only pay when users access dashboards).
- Scales to thousands of users instantly.
QuickSight – Integrations
QuickSight can connect to a wide range of AWS services, on-premises, and third-party data sources, including:
a) AWS Native Integrations
- Amazon S3 → CSV, JSON, Parquet files.
- Amazon RDS (MySQL, PostgreSQL, Oracle, SQL Server).
- Amazon Aurora.
- Amazon Redshift.
- Amazon Athena → Query data in S3 using SQL.
- Amazon OpenSearch Service.
- AWS IoT Analytics.
- AWS Glue Data Catalog.
b) Other Cloud and External Integrations
- Salesforce.
- ServiceNow.
- Twitter.
- Jira.
- Adobe Analytics.
c) On-Premises Databases
- MySQL, PostgreSQL, Microsoft SQL Server, Oracle, MariaDB via JDBC/ODBC connection or AWS Direct Connect/VPN.
d) File-based Sources
- CSV, XLSX files uploaded directly to QuickSight.
3️⃣ QuickSight – Dashboard & Analysis
a) Analysis in QuickSight
An Analysis is where you:
- Prepare and explore datasets.
- Build charts, graphs, and tables.
- Apply filters, calculated fields, and aggregations.
- Use ML-powered insights (forecasting, anomaly detection).
✅ You can create:
- Bar charts, line charts, pie charts.
- Heat maps, pivot tables.
- KPI widgets.
- Geospatial maps.
Once the analysis is ready, you can publish it as a dashboard.
b) Dashboards in QuickSight
A Dashboard is:
- A read-only, interactive view shared with other users.
- Based on an analysis you created.
- Auto-updated when data changes.
Key Features:
- Drill-down and filter options.
- Embedded dashboards in applications using QuickSight SDK.
- Accessible via web browser or mobile app.
- Supports row-level security to restrict data per user.
- Can be scheduled for email reports.
Example Use Case:
- Data in S3 → Athena → QuickSight analysis.
- Build a sales performance dashboard.
- Share with the sales team with role-based access.
- Enable ML-powered forecasting for upcoming sales trends.
✅ QuickSight Cheat Sheet
Feature | Description |
---|---|
Integrations | S3, Athena, RDS, Redshift, OpenSearch, Salesforce, On-prem DBs |
Analysis | Data exploration, visualization building, ML insights |
Dashboard | Published interactive, read-only reports for users |
Access Control | Row-level security and IAM integration |
Pricing | Pay-per-user or pay-per-session |
AWS Glue – Overview
AWS Glue is a serverless ETL (Extract, Transform, Load) and data integration service that helps you:
- Discover, prepare, and combine data from multiple sources.
- Transform raw data into usable formats for analytics and machine learning.
- Build data lakes, data warehouses, and analytics pipelines.
AWS Glue – Convert Data into Parquet Format
Parquet is a columnar storage format that provides:
- Better query performance than row-based formats (like CSV or JSON).
- Efficient storage, with built-in compression.
How AWS Glue Converts Data to Parquet:
- Glue reads source data (CSV, JSON, relational DB, etc.).
- Uses a Glue ETL Job (PySpark or Scala) to process and transform the data.
- Writes the output to Amazon S3 in Parquet format.
Example ETL Flow
Source Data → Glue Crawler → Glue Job → Convert to Parquet → Store in S3 → Query via Athena/Redshift Spectrum
✅ Benefits of Parquet conversion in Glue:
- Optimized for big data analytics.
- Works well with Athena, Redshift Spectrum, EMR, and QuickSight.
- Reduces storage cost and query execution time.
3️⃣ AWS Glue Data Catalog
The Glue Data Catalog is a central metadata repository that stores:
- Table definitions
- Schema information
- Partition locations
- Data source connections
Key Points:
- Works as a "Hive Metastore" replacement for AWS.
- AWS Athena, Redshift Spectrum, EMR, and QuickSight rely on it to understand data structure.
- Populated using Glue Crawlers:
- Crawlers scan S3 or databases.
- Automatically infer schema and store metadata.
Example Use Case:
- Data stored in S3 in different formats.
- Glue crawler detects tables and schema → Creates catalog entries.
- Athena queries data using standard SQL without manual schema setup.
4️⃣ AWS Glue – Things to Know (High-Level)
- Serverless: No infrastructure to manage; scales automatically.
- ETL Jobs: Written in PySpark or Scala, run in a managed Apache Spark environment.
- Job Triggers: Can be scheduled or event-driven (e.g., S3 file upload).
- Integration: Works seamlessly with S3, Redshift, Athena, Lake Formation, DynamoDB.
- Job Bookmarks: Prevent reprocessing of previously processed data.
- Pricing: Based on Data Processing Units (DPUs) used per job.
- Glue Studio: Provides a visual interface to build ETL pipelines without coding.
- AWS Glue DataBrew: For low-code/no-code data preparation tasks.
✅ Quick Cheat Sheet
Feature | Description |
---|---|
Parquet Conversion | ETL job converts data to Parquet for better query performance. |
Data Catalog | Central repository of dataset schemas for AWS analytics services |
Serverless ETL | Scalable data transformation with PySpark or visual workflows |
Integration | Works with S3, Athena, Redshift, EMR, QuickSight |
Cost | Pay-per-DPU, serverless |
AWS Lake Formation – Overview
AWS Lake Formation is a managed service for building secure data lakes on AWS.
It helps you:
- Collect, catalog, clean, secure, and share data from multiple sources.
- Enforce fine-grained permissions (table, column, row-level) for analytics.
- Manage centralized access control across services like Athena, Redshift Spectrum, EMR, Glue, and QuickSight.
2️⃣ Key Components of Lake Formation
- Data Lake: Data stored in Amazon S3 in raw or processed format.
- Glue Data Catalog: Metadata repository (integrates with Lake Formation).
- Permissions & Policies: Grant granular access to databases, tables, or even columns.
- Data Sharing: Securely share datasets across AWS accounts or organizations.
3️⃣ Centralized Permissions in AWS Lake Formation
Lake Formation allows centralized access control for all data lake users.
Permissions are defined once in Lake Formation and apply to all integrated services.
Types of Permissions:
- Data Location Permissions: Access to underlying S3 buckets.
-
Database & Table Permissions:
- SELECT, ALTER, DROP, CREATE, DESCRIBE
- GRANT/REVOKE to other principals.
- Column-level Permissions: Limit access to specific fields (e.g., hide salary column).
- Row-level Permissions: Provide conditional access to subsets of data.
4️⃣ Example – Centralized Permissions
Scenario:
- A retail company stores sales data in Amazon S3 (
sales_db
database). - They have three teams: Finance, Marketing, and Data Science.
- Data is queried using Athena and visualized in QuickSight.
Step 1: Set up the Data Lake
- Data ingested from multiple sources into S3 bucket.
-
AWS Glue Crawler populates Data Catalog with
sales_db
andtransactions
table.
Step 2: Define Permissions in Lake Formation
-
Finance Team:
-
Full access to all columns.
-
SQL query example in Athena:
SELECT * FROM sales_db.transactions;
-
-
Marketing Team:
-
Access only to region, product, and revenue columns.
-
Deny sensitive fields (e.g., customer_email).
-
Permission granted at column level:
Permissions: SELECT (region, product, revenue)
-
-
Data Science Team:
-
Row-level filtering: Access only to transactions where
region = 'US'
. -
Example policy:
Filter: region = 'US' Permissions: SELECT *
-
Step 3: Users Query Data via Integrated Services
- Athena: Users run queries but only see data allowed by permissions.
- QuickSight: Dashboards show restricted data based on team access.
- Redshift Spectrum or EMR: Same permissions enforced automatically.
✅ Benefits of Centralized Permissions
- Single source of truth: Permissions managed centrally in Lake Formation.
- Consistent security: Applied across Athena, Redshift Spectrum, EMR, and QuickSight.
- Granular access: Table, column, and row-level controls.
- Cross-account sharing: Securely share data with other AWS accounts.
✅ Cheat Sheet
Feature | Description |
---|---|
Centralized Access | One permission model for all analytics services |
Column & Row-level | Restrict sensitive fields or specific dataset portions |
S3 Integration | Works with Glue Data Catalog + S3 as data lake |
Cross-account | Share data securely with other AWS accounts |
Supported Services | Athena, Redshift Spectrum, EMR, Glue, QuickSight |
Kinesis Data Analytics (KDA)
AWS Kinesis Data Analytics (KDA) is a serverless service that allows you to analyze streaming data in real-time using:
- SQL-based applications or
- Apache Flink applications
It processes data from sources like Kinesis Data Streams or Amazon MSK (Kafka) and outputs results to destinations like S3, DynamoDB, Lambda, or Redshift.
Kinesis Data Analytics for SQL Applications
- Enables real-time analytics on streaming data using standard SQL queries.
- Ideal for developers familiar with SQL but not programming frameworks.
- Can filter, aggregate, join, and transform streaming data.
Example Use Case:
- An e-commerce website streams order events into Kinesis Data Streams.
- A KDA SQL application calculates total sales per product every minute.
- Output is sent to S3 or Redshift for BI dashboards.
3️⃣ Kinesis Data Analytics (SQL Application)
Key Features:
- Continuous SQL queries: Runs non-stop on incoming data.
- Windowed operations: Allows time-based aggregations.
- Real-time dashboards: Monitor trends and anomalies instantly.
Example Query:
CREATE OR REPLACE STREAM "DESTINATION_SQL_STREAM"
AS SELECT product_id, SUM(price) AS total_sales
FROM "SOURCE_SQL_STREAM_001"
GROUP BY product_id, STEP("INTERVAL" BY INTERVAL '1' MINUTE);
✅ Benefit: Low-code solution for streaming ETL and analytics.
4️⃣ Kinesis Data Analytics for Apache Flink
- Enables advanced stream processing using Apache Flink, an open-source, distributed framework.
- Supports Java, Scala, and Python applications for complex use cases:
- Event-time processing
- Stateful stream applications
- Machine learning on streaming data
Example Use Case:
- Fraud detection on payment transactions using Flink to:
- Correlate events across time windows
- Detect anomalies in real time
✅ Benefit: High flexibility & advanced capabilities compared to SQL-based apps.
Amazon Managed Streaming for Apache Kafka (MSK)
Amazon MSK is a fully managed service for Apache Kafka, a popular open-source event streaming platform.
- Handles high-throughput, real-time data ingestion.
- Provides durable storage of streams with strong ordering guarantees.
- Used when organizations already use Kafka or need more control than Kinesis.
Key Features:
- AWS manages Kafka cluster provisioning, patching, scaling, and monitoring.
- Integrates with EC2, Lambda, ECS, EKS, and on-premise apps.
- MSK Serverless option for automatic scaling.
6️⃣ Kinesis Data Streams vs. Amazon MSK
Feature | Kinesis Data Streams | Amazon MSK (Kafka) |
---|---|---|
Type | AWS-native managed streaming service | Managed Apache Kafka (open source) |
Setup | Fully serverless | Requires creating Kafka clusters |
Throughput Scaling | Automatic with shard scaling | Manual or via autoscaling |
Ecosystem | Works tightly with AWS services | Open-source Kafka ecosystem supported |
Latency | Milliseconds | Low milliseconds |
Use Case | AWS-first, quick setup, predictable workloads | Existing Kafka clients, hybrid environments |
✅ Choose Kinesis for AWS-native, simpler workloads.
✅ Choose MSK if you need Kafka-specific features or existing Kafka applications.
7️⃣ Amazon MSK Consumers
- Consumers are applications that read data from Kafka topics.
- Examples:
- AWS Lambda function triggered by Kafka events.
- EC2/ECS/EKS services consuming topic messages.
- Kinesis Data Analytics for Apache Flink reading from MSK for processing.
- Third-party tools (e.g., Logstash, Spark Streaming) integrated via Kafka APIs.
8️⃣ Big Data Ingestion Pipeline (Example Architecture)
Goal: Process millions of incoming data events (e.g., IoT sensor data, clickstreams) in near real time.
Step-by-Step Flow:
-
Data Producers:
-
Mobile apps, IoT devices, websites generate events.
-
-
Streaming Ingestion Layer:
-
Kinesis Data Streams or MSK ingests data.
-
-
Processing Layer:
-
Kinesis Data Analytics (SQL) for quick aggregations.
-
Kinesis Data Analytics for Flink for advanced processing or ML models.
-
-
Storage Layer:
-
Amazon S3 for data lake storage.
-
DynamoDB or Redshift for structured analytics.
-
-
Consumption Layer:
-
Athena, QuickSight, or custom dashboards for insights.
-
Lambda or EventBridge triggers for downstream workflows.
-
✅ Cheat Sheet
Component | Purpose |
---|---|
KDA SQL | Simple, SQL-based streaming analytics |
KDA Flink | Advanced, code-based streaming applications |
MSK | Kafka-based ingestion, high customization |
Kinesis vs MSK | AWS-native simplicity vs open-source control |
Consumers | Apps, Lambda, analytics engines |
Big Data Pipeline | Ingestion → Processing → Storage → Insights |