Alphazed runs its entire backend — serving 95,000+ students across 50+ countries — on AWS Lambda with Serverless Framework. The architecture uses Flask on Lambda behind API Gateway, MySQL 8 on RDS, S3 for content delivery, and a custom analytics lake (SQS → Kinesis Firehose → S3 → Glue → Athena). Thin Lambda handlers optimize cold-start latency, and the system serves 7+ apps from a single codebase with runtime configuration switching.
Why Serverless for EdTech?
Educational apps have unpredictable usage patterns:
- Weekday mornings: Parent downloads app before sending child to school (traffic spike)
- Weekday afternoons: After-school practice sessions (sustained load)
- Weekends: Intensive marathon sessions (2-3x normal load)
- During Ramadan: Evening usage explodes (family Quran sessions)
- School holidays: Completely different pattern
Serverless advantages:
- Pay-per-request pricing: You only pay for actual usage. If 10 users hit the API, you pay for 10 invocations. If 100,000 hit during a viral moment, you scale instantly.
- Zero cold starts for high-frequency endpoints: We use "always warm" Lambda layers for frequently-called endpoints
- Auto-scaling: Handle 10 concurrent users or 10,000 with zero infrastructure changes
- Zero server maintenance: The team focuses on curriculum and AI, not Kubernetes clusters or load balancers
Architecture Deep-Dive
API Gateway → Lambda → RDS
[Client App] (iOS, Android, Web)
↓
[API Gateway] (HTTP routing, rate limiting)
↓
[Lambda Handlers] (Flask app, 512MB memory, 28s timeout)
├── App routes: /app/* (mobile endpoints)
├── User routes: /user/* (authenticated endpoints)
└── Admin routes: /boss/* (admin dashboard)
↓
[MySQL 8 on RDS] (Persistent data)
↓
[Response] (JSON back to client)
Thin Lambdas for Speed
Most Lambdas are intentionally minimal:
# Thin handler (~100KB)
import json
import pymysql
def get_user_progress(event, context):
user_id = event['pathParameters']['user_id']
# Direct DB connection (no ORM overhead)
conn = pymysql.connect(host='rds.aws.com', user='app', password='...', database='amal')
cursor = conn.cursor()
cursor.execute(
'SELECT concept_id, accuracy FROM user_memory WHERE user_id = %s',
(user_id,)
)
rows = cursor.fetchall()
conn.close()
return {
'statusCode': 200,
'body': json.dumps([{'concept': r[0], 'accuracy': r[1]} for r in rows])
}
No Flask import, no SQLAlchemy ORM, no middleware. Result: ~500ms cold start vs. 5-10s for full Flask app.
Heavy endpoints (content generation, analytics processing) use full Flask:
# Heavy handler (~30MB with Flask, SQLAlchemy, numpy)
from flask import Flask, jsonify
from models import UserMemory
import numpy as np
app = Flask(__name__)
@app.route('/content_duo/generate', methods=['POST'])
def generate_content_duo():
# Complex logic requiring ORM
user = UserMemory.query.filter_by(user_id=request.json['user_id']).first()
# ... generate personalized session ...
return jsonify(session_data)
Trade-off: cold starts are slower, but these are called less frequently.
Per-App Table Prefixing
One RDS instance serves 7+ apps with database-level isolation:
-- Amal app
CREATE TABLE amal_users (...)
CREATE TABLE amal_content_bytes (...)
CREATE TABLE amal_user_memory (...)
-- Thurayya app
CREATE TABLE thurayya_users (...)
CREATE TABLE thurayya_content_bytes (...)
CREATE TABLE thurayya_user_memory (...)
-- Other apps: qais_*, kiDelite_*, etc.
At deploy time, APP_NAME environment variable selects prefix:
app_name = os.getenv('APP_NAME', 'amal') # 'amal', 'thurayya', 'qais', etc.
# Queries dynamically use prefix
table_name = f'{app_name}_users'
cursor.execute(f'SELECT * FROM {table_name} WHERE id = %s', (user_id,))
The Analytics Lake
Problem: Direct database queries for analytics slow down production. Running reports locks tables.
Solution: Asynchronous analytics pipeline
[Mobile App]
↓ (sends event)
[API Endpoint] → [SQS Queue] (async)
↓ (immediately responds to app)
↓ (doesn't wait for analytics)
[Kinesis Firehose] (batches events every 5 min or when 100MB reached)
↓
[S3] (partitioned: s3://analytics-lake/amal/2026/03/28/events.parquet)
↓
[AWS Glue] (crawls S3, infers schema)
↓
[Athena] (SQL queries via Presto engine)
↓
[Dashboard] (shows real-time insights)
Dead Letter Queue (DLQ) Pattern
If analytics fails:
SQS → [Firehose fails]
↓
[DLQ receives failed messages]
↓
[Alert sent to ops]
↓
[Production API is unaffected]
Analytics never blocks user requests. Children can learn even if the analytics pipeline is down.
Cost Optimization Strategies
Strategy 1: Thin Lambdas for high-frequency endpoints
- Typical mobile app makes 10-20 API calls per session
- 95,000 active users × 3 sessions/day × 15 calls/session = 4.275M calls/day
- If each call costs $0.0000002 (Lambda pricing), that's $0.86/day
- Reducing cold start time by 10s saves ~$500/month
Strategy 2: RDS Reserved Instances
- Committed 3-year reservation: ~60% discount vs. on-demand
- We use
db.r6i.xlarge(4 vCPU, 32GB RAM): $2,800/month reserved vs. $6,500/month on-demand - Annual savings: ~$50,000
Strategy 3: Caching
- Frequently-accessed data (curriculum, content bytes) cached in ElastiCache (Redis)
- Reduces RDS queries by 70%
- Cost: $800/month for cache, saves $2,000/month in RDS
Serving 7+ Apps from One Codebase
| App | Prefix | DB Tables | Lambda Stack | Status |
|---|---|---|---|---|
| Amal | amal_ |
40+ tables | Shared | Production |
| Thurayya | thurayya_ |
40+ tables | Shared | Production |
| Qais | qais_ |
35+ tables | Shared | Beta |
| KidElite | kidelite_ |
40+ tables | Shared | Production |
| Alphazed School | school_ |
50+ tables | Shared | Beta |
| Alphazed Montessori | montessori_ |
45+ tables | Shared | Internal |
One backend, one deployment pipeline, 6 simultaneous apps. New app launch: weeks instead of months.
FAQ
Q: Doesn't Lambda have a 15-minute timeout limit? A: Lambda has a 15-minute max timeout, but we rarely need long-running requests. Heavy workloads (content generation, large exports) use async jobs with SQS + Step Functions.
Q: What if the database goes down? A: RDS has Multi-AZ failover (primary + standby replica). Failover is automatic and takes ~60 seconds. Clients see brief timeouts but recovery is fast.
Q: How do you handle database connection pooling with stateless Lambda? A: Each Lambda instance maintains a connection pool (reused across warm invocations). Cold starts get fresh connections. RDS Proxy sits between Lambda and RDS to manage connection limits.



