Why We Built Lip-Sync Animation for Arabic Sounds
4 min readMohammad Shaker

Why We Built Lip-Sync Animation for Arabic Sounds

Amal uses lip-sync animation to show children how each Arabic sound is formed, making pronunciation easier to see and copy.

AI & Speech

Quick Answer

Amal uses lip-sync animation to show children how each Arabic sound is formed, making pronunciation easier to see and copy.

Why We Built a Lip-Sync Animation System for Every Arabic Sound

Amal uses Rive-powered lip-sync animations that show children exactly how to form each Arabic sound — the character's mouth moves in sync with audio pronunciation. This visual-phonetic approach helps children learn pronunciation intuitively, especially for sounds that don't exist in English (like ع, خ, غ, ح).

The Problem: Arabic Has Sounds English Doesn't

Arabic phonetics include:

  • Pharyngeal consonants (ع, ح): produced deep in the throat, no English equivalent
  • Uvular consonants (ق, خ, غ): produced at the back of the mouth
  • Emphatic consonants (ص, ض, ط, ظ): pronounced with tongue retraction

Children can't learn these sounds from text alone — they need to see mouth position. Traditional approach: a teacher demonstrates in person. Our approach: an AI character demonstrates on screen, infinitely patient and always available.

How the Lip-Sync System Works

The Rive Animation Engine Rive (formerly Flare) is a 2D animation system with state machine support. We use it because:

  • State machines enable smooth transitions between idle → speaking → error → celebration
  • Runtime manipulation: we change mouth position programmatically, not playing pre-rendered sequences
  • Single .riv file contains all animation states (vs. hundreds of sprite frames)
  • GPU-accelerated, 60fps on mid-range devices

Speech Marks Pipeline

  1. Text-to-speech generates audio for "أَنَا" (I)
  2. TTS returns "speech marks" — precise timestamps for each phoneme
  3. Our lip_sync_avatar.json maps phonemes → Rive mouth states
  4. LipSyncController drives state machine transitions in sync with playback
  5. Child sees the character's mouth forming the correct position as they hear the sound
TTS Audio + Speech Marks
    ↓
[Extract Phoneme Timing]
    ↓
[Map to Rive States]
    ↓
[Animate Character Mouth]
    ↓
[Child Sees Mouth Position]

Multiple Character Variants

  • Main Amal character with full-body and face-only variants
  • Friendly auxiliary characters for variety and engagement
  • Customizable avatars: children choose head shape, clothing, colors, accessories
  • Emotional states: idle, speaking, error (encouraging), celebration (praise)

When children customize their character, that personalized avatar teaches them throughout the app — creating emotional investment.

Why Rive (Not Lottie or Sprite Sheets)

Approach State Machines Runtime Control File Size Performance Cost
Rive 1.2 MB 60fps Engineering time
Lottie Partial 2-3 MB 30fps Animation time
Sprites Manual 50+ MB 60fps Asset storage
Video N/A 100+ MB Variable Hosting cost

Rive wins because we need programmatic control, state transitions, and compact file sizes for a mobile app serving 95,000+ children.

Educational Impact

Research shows visual-phonetic learning (seeing mouth position while hearing sound) accelerates pronunciation acquisition. Our internal data:

  • Children who see lip-sync learn pronunciation 40% faster
  • Pronunciation accuracy improves 3x faster with visual feedback
  • Particularly effective for diaspora children without Arabic speakers at home

Why Competitors Can't Match This

Reproducing this requires:

  1. Phonetics expertise (knowing which mouth positions match which sounds)
  2. Rive animation skills (not trivial — state machine design is complex)
  3. TTS speech marks integration (not all TTS providers offer this)
  4. Mobile optimization (Rive rendering at 60fps across devices)
  5. Character customization system (component-based avatar architecture)

FAQ

Q: Can my child adjust the animation speed? A: Yes. Slower speeds help with difficult sounds; faster speeds suit advanced learners. The app adapts based on performance.

Q: Do all exercises have lip-sync animation? A: Speak-out-loud and pronunciation exercises feature full lip-sync. Other exercise types (games, puzzles) use the character for encouragement and reward animations.

Q: Why does the character sometimes show an error animation? A: When speech recognition detects mispronunciation, the character gently shows a "let's try again" expression. This is encouraging, not punishing — children learn through iterative attempts.

Related Articles