Serverless at Scale: Running Arabic EdTech on AWS Lambda
5 min readMohammad Shaker

Serverless at Scale: Running Arabic EdTech on AWS Lambda

Alphazed runs its entire backend on AWS Lambda serving 95,000+ students with Flask, MySQL 8 on RDS, and a custom analytics lake for learning outcomes.

Engineering

Quick Answer

Alphazed runs its entire backend on AWS Lambda serving 95,000+ students with Flask, MySQL 8 on RDS, and a custom analytics lake for learning outcomes.

Alphazed runs its entire backend — serving 95,000+ students across 50+ countries — on AWS Lambda with Serverless Framework. The architecture uses Flask on Lambda behind API Gateway, MySQL 8 on RDS, S3 for content delivery, and a custom analytics lake (SQS → Kinesis Firehose → S3 → Glue → Athena). Thin Lambda handlers optimize cold-start latency, and the system serves 7+ apps from a single codebase with runtime configuration switching.

Why Serverless for EdTech?

Educational apps have unpredictable usage patterns:

  • Weekday mornings: Parent downloads app before sending child to school (traffic spike)
  • Weekday afternoons: After-school practice sessions (sustained load)
  • Weekends: Intensive marathon sessions (2-3x normal load)
  • During Ramadan: Evening usage explodes (family Quran sessions)
  • School holidays: Completely different pattern

Serverless advantages:

  • Pay-per-request pricing: You only pay for actual usage. If 10 users hit the API, you pay for 10 invocations. If 100,000 hit during a viral moment, you scale instantly.
  • Zero cold starts for high-frequency endpoints: We use "always warm" Lambda layers for frequently-called endpoints
  • Auto-scaling: Handle 10 concurrent users or 10,000 with zero infrastructure changes
  • Zero server maintenance: The team focuses on curriculum and AI, not Kubernetes clusters or load balancers

Architecture Deep-Dive

API Gateway → Lambda → RDS

[Client App] (iOS, Android, Web)
    ↓
[API Gateway] (HTTP routing, rate limiting)
    ↓
[Lambda Handlers] (Flask app, 512MB memory, 28s timeout)
    ├── App routes: /app/* (mobile endpoints)
    ├── User routes: /user/* (authenticated endpoints)
    └── Admin routes: /boss/* (admin dashboard)
    ↓
[MySQL 8 on RDS] (Persistent data)
    ↓
[Response] (JSON back to client)

Thin Lambdas for Speed

Most Lambdas are intentionally minimal:

# Thin handler (~100KB)
import json
import pymysql

def get_user_progress(event, context):
    user_id = event['pathParameters']['user_id']
    
    # Direct DB connection (no ORM overhead)
    conn = pymysql.connect(host='rds.aws.com', user='app', password='...', database='amal')
    cursor = conn.cursor()
    cursor.execute(
        'SELECT concept_id, accuracy FROM user_memory WHERE user_id = %s',
        (user_id,)
    )
    rows = cursor.fetchall()
    conn.close()
    
    return {
        'statusCode': 200,
        'body': json.dumps([{'concept': r[0], 'accuracy': r[1]} for r in rows])
    }

No Flask import, no SQLAlchemy ORM, no middleware. Result: ~500ms cold start vs. 5-10s for full Flask app.

Heavy endpoints (content generation, analytics processing) use full Flask:

# Heavy handler (~30MB with Flask, SQLAlchemy, numpy)
from flask import Flask, jsonify
from models import UserMemory
import numpy as np

app = Flask(__name__)

@app.route('/content_duo/generate', methods=['POST'])
def generate_content_duo():
    # Complex logic requiring ORM
    user = UserMemory.query.filter_by(user_id=request.json['user_id']).first()
    # ... generate personalized session ...
    return jsonify(session_data)

Trade-off: cold starts are slower, but these are called less frequently.

Per-App Table Prefixing

One RDS instance serves 7+ apps with database-level isolation:

-- Amal app
CREATE TABLE amal_users (...)
CREATE TABLE amal_content_bytes (...)
CREATE TABLE amal_user_memory (...)

-- Thurayya app
CREATE TABLE thurayya_users (...)
CREATE TABLE thurayya_content_bytes (...)
CREATE TABLE thurayya_user_memory (...)

-- Other apps: qais_*, kiDelite_*, etc.

At deploy time, APP_NAME environment variable selects prefix:

app_name = os.getenv('APP_NAME', 'amal')  # 'amal', 'thurayya', 'qais', etc.

# Queries dynamically use prefix
table_name = f'{app_name}_users'
cursor.execute(f'SELECT * FROM {table_name} WHERE id = %s', (user_id,))

The Analytics Lake

Problem: Direct database queries for analytics slow down production. Running reports locks tables.

Solution: Asynchronous analytics pipeline

[Mobile App]
    ↓ (sends event)
[API Endpoint] → [SQS Queue] (async)
    ↓ (immediately responds to app)
    ↓ (doesn't wait for analytics)
[Kinesis Firehose] (batches events every 5 min or when 100MB reached)
    ↓
[S3] (partitioned: s3://analytics-lake/amal/2026/03/28/events.parquet)
    ↓
[AWS Glue] (crawls S3, infers schema)
    ↓
[Athena] (SQL queries via Presto engine)
    ↓
[Dashboard] (shows real-time insights)

Dead Letter Queue (DLQ) Pattern

If analytics fails:

SQS → [Firehose fails]
  ↓
  [DLQ receives failed messages]
  ↓
  [Alert sent to ops]
  ↓
  [Production API is unaffected]

Analytics never blocks user requests. Children can learn even if the analytics pipeline is down.

Cost Optimization Strategies

Strategy 1: Thin Lambdas for high-frequency endpoints

  • Typical mobile app makes 10-20 API calls per session
  • 95,000 active users × 3 sessions/day × 15 calls/session = 4.275M calls/day
  • If each call costs $0.0000002 (Lambda pricing), that's $0.86/day
  • Reducing cold start time by 10s saves ~$500/month

Strategy 2: RDS Reserved Instances

  • Committed 3-year reservation: ~60% discount vs. on-demand
  • We use db.r6i.xlarge (4 vCPU, 32GB RAM): $2,800/month reserved vs. $6,500/month on-demand
  • Annual savings: ~$50,000

Strategy 3: Caching

  • Frequently-accessed data (curriculum, content bytes) cached in ElastiCache (Redis)
  • Reduces RDS queries by 70%
  • Cost: $800/month for cache, saves $2,000/month in RDS

Serving 7+ Apps from One Codebase

App Prefix DB Tables Lambda Stack Status
Amal amal_ 40+ tables Shared Production
Thurayya thurayya_ 40+ tables Shared Production
Qais qais_ 35+ tables Shared Beta
KidElite kidelite_ 40+ tables Shared Production
Alphazed School school_ 50+ tables Shared Beta
Alphazed Montessori montessori_ 45+ tables Shared Internal

One backend, one deployment pipeline, 6 simultaneous apps. New app launch: weeks instead of months.

FAQ

Q: Doesn't Lambda have a 15-minute timeout limit? A: Lambda has a 15-minute max timeout, but we rarely need long-running requests. Heavy workloads (content generation, large exports) use async jobs with SQS + Step Functions.

Q: What if the database goes down? A: RDS has Multi-AZ failover (primary + standby replica). Failover is automatic and takes ~60 seconds. Clients see brief timeouts but recovery is fast.

Q: How do you handle database connection pooling with stateless Lambda? A: Each Lambda instance maintains a connection pool (reused across warm invocations). Cold starts get fresh connections. RDS Proxy sits between Lambda and RDS to manage connection limits.

Related Articles