Amal and Thurayya use a dual speech recognition architecture: on-device STT for instant, zero-latency feedback while the child speaks, and Google Cloud Speech-to-Text for higher-accuracy pronunciation scoring after the child finishes. This hybrid approach gives children the immediate responsiveness they need to stay engaged while ensuring accuracy for meaningful learning.
The Fundamental Tradeoff
| Metric | Device STT | Cloud STT | Need Both |
|---|---|---|---|
| Latency | ~100ms | ~500ms | Instant feedback + accuracy |
| Accuracy | 70% | 92% | Confidence scoring |
| Offline | ✓ | ✗ | Resilience |
| Diacritic awareness | Limited | High (with context) | Full Arabic support |
| Pronunciation detail | Coarse | Word-level timestamps | Speech marks for animation |
The child needs both simultaneously:
- Instant feedback keeps them engaged (device STT)
- Accurate feedback ensures real learning (cloud STT)
Implementation Deep-Dive
Device STT Layer (DeviceSTTMechanism)
Uses the speech_to_text Flutter package:
Child speaks "كتب" (kataba — wrote)
↓
[Device streams partial results]
↓
UI shows green highlights: "كتب" (70% confidence)
↓
[Zero latency — child sees feedback while speaking]
Device STT is perfect for "work in progress" display. Children see what the app is hearing in real-time, which maintains engagement and provides immediate audio confirmation.
Cloud STT Layer (BackendGoogleSTTMechanism)
- Audio is sent to backend → Google Cloud Speech-to-Text
- We send the expected text as a "speech context" hint
- Google returns word-level timestamps and confidence scores
- Backend performs similarity comparison (0.7 threshold)
- Result is returned to app for final scoring
Cloud STT is slower but far more accurate, especially with diacritical context.
Speech Context Biasing: The Game-Changer
Google Speech-to-Text allows "speech adaptation" — we send the expected text as a recognition hint. This is transformative for Arabic:
Without context biasing: Child recites: "بِسْمِ اللَّهِ الرَّحْمَنِ الرَّحِيمِ" (Basmala — the opening prayer phrase) Google hears: Generic Arabic words, 50-60% accuracy
With context biasing: Child recites: "بِسْمِ اللَّهِ الرَّحْمَنِ الرَّحِيمِ" We tell Google: "Listen for this exact Quranic phrase" Google returns: 92%+ accuracy with word-level timestamps
Internal benchmarks: Context biasing improves recognition accuracy by 35-50% for expected text.
Word-Level Timestamps for Speech Marks
Cloud STT returns:
{
"results": [
{
"word": "كتب",
"start_time": 0.2,
"end_time": 0.8,
"confidence": 0.94
}
]
}
These timestamps drive:
- Lip-sync animations (blog #3): mouth position changes at precise moments
- Per-word highlighting: child sees which exact word they're on
- Error pinpointing: if they mispronounce one word in a phrase, we know which one
Graceful Degradation
If cloud STT is unavailable (no internet, API timeout), the system gracefully uses device STT alone. Children never see an error — they just get slightly less accurate feedback. The app doesn't break; it just scales back to device-only mode.
Why Competitors Can't Match This
Replicating this requires:
- Mobile STT architecture expertise (managing dual streams)
- Google Cloud integration with speech adaptation
- Backend infrastructure for audio processing
- Similarity scoring tuned for Arabic diacritics
- Graceful degradation patterns
- 95,000+ learner data to validate accuracy
FAQ
Q: Which speech recognition is used for my child's score? A: Cloud STT with context biasing. Device STT is for WIP feedback only. We combine both to determine final accuracy.
Q: Why does my child see green text while speaking but different results after? A: Device STT shows partial, less-accurate results in real-time. Cloud STT's more accurate results arrive after speaking finishes. Both feedback loops are valuable.
Q: Does using two STT systems cost more? A: Yes, but the accuracy and engagement improvement justifies the cost. We optimize by using device STT first and only sending full audio to cloud for scoring.
Related reading
See how Amal corrects Arabic pronunciation in real time and how Thurayya applies the same stack to tajweed.



