5 min readMohammad Shaker
Arabic Diacritics Done Right: How Amal Handles Tashkeel, Shadda, and Hamza
Amal handles the full complexity of Arabic diacritics: 8 tashkeel marks, 4 alef variants, 3 hamza variants, and Lam-Alef ligatures. The app's speech recognition, text rendering, and similarity scoring all treat diacritized Arabic differently from undiacritized Arabic.
Engineering
Quick Answer
Amal handles the full complexity of Arabic diacritics: 8 tashkeel marks, 4 alef variants, 3 hamza variants, and Lam-Alef ligatures. The app's speech recognition, text rendering, and similarity scoring all treat diacritized Arabic differently from undiacritized Arabic.
## Arabic Diacritics Done Right: How Amal Handles Tashkeel, Shadda, and Hamza
Amal handles the full complexity of Arabic diacritics: 8 tashkeel marks (fatha, damma, kasra, shadda, sukun, fathatan, dammatan, kasratan), 4 alef variants (standard, madda, hamza above, hamza below, wasla), 3 hamza variants (isolated, on waw, on ya), and Lam-Alef ligatures. The app's speech recognition, text rendering, and similarity scoring all treat diacritized Arabic ("كَتَبَ") differently from undiacritized Arabic ("كتب") — a critical distinction most Arabic learning apps ignore.
### Why Diacritics Matter for Learning
**The Ambiguity Problem**
Arabic without diacritics is ambiguous:
- "كتب" can mean:
- "kataba" (he wrote) — past tense
- "kutub" (books) — plural noun
- "kutiba" (it was written) — passive voice
All are spelled identically without diacritics. Diacritics remove ambiguity.
**The Learning Progression**
1. **Beginner**: Learn to read WITH diacritics (easy — vowels are marked)
2. **Intermediate**: Practice WITH diacritics until automatic
3. **Advanced**: Gradually remove diacritics, reading becomes harder
4. **Fluent**: Read without diacritics fluently (native-level reading)
Most Arabic learning apps skip step 1 — they don't teach diacritics at all, or strip them away. This teaches bad habits. Amal's progression is scientifically correct.
### Our Unicode-Level Implementation
**The Diacritical Marks** (8 total)
```dart
// lib/src/utils/arabic_extension.dart
class ArabicExtension {
static const Map tashkeelMarks = {
'FATHA': '\u064E', // َ (vowel 'a')
'DAMMA': '\u064F', // ُ (vowel 'u')
'KASRA': '\u0650', // ِ (vowel 'i')
'SUKUN': '\u0652', // ْ (no vowel)
'SHADDA': '\u0651', // ّ (doubled letter)
'FATHATAN': '\u064B', // ً (tanween 'an')
'DAMMATAN': '\u064C', // ٌ (tanween 'un')
'KASRATAN': '\u064D', // ٍ (tanween 'in')
};
static const Map alefVariants = {
'ALEF_STANDARD': 'ا', // ا
'ALEF_WITH_MADDA': 'آ', // آ (elongated)
'ALEF_WITH_HAMZA_ABOVE': 'أ', // أ
'ALEF_WITH_HAMZA_BELOW': 'إ', // إ
'ALEF_WASLA': 'ٰ', // ٰ (connecting alef)
};
static const Map hamzaVariants = {
'HAMZA_ISOLATED': 'ء', // Standalone hamza
'HAMZA_ON_WAW': 'ؤ', // Hamza on waw (و + hamza)
'HAMZA_ON_YEH': 'ئ', // Hamza on yeh (ي + hamza)
};
}
```
**Quranic Diacritics and Uthmani Stops**
For Thurayya, we support Quranic-specific marks:
```dart
static const Map quranicMarks = {
'STOP_FULL': 'ۖ', // Full stop (‖)
'STOP_HALF': 'ۗ', // Half stop
'STOP_QUA': 'ۙ', // Qua stop
'STOP_NECESSARY': 'ۚ', // Necessary stop
'TAJWEED_ELONGATION': '', // Elongation indicator
};
```
### Diacritic-Aware Speech Recognition
**Context Biasing with Diacritics**
When a child is learning "كَتَبَ" (he wrote, past tense), we bias speech recognition toward that exact vocalization:
```python
# src/services/stt_client.py
def recognize_with_diacritical_context(audio_bytes, expected_text):
# expected_text = "كَتَبَ" (with diacritics)
# Create speech context hint
speech_context = {
'phrases': [expected_text],
'boost': 20.0 # High boost for expected text
}
# Send to Google Cloud STT
response = google_stt_client.recognize(
audio=audio_bytes,
language_code='ar-SA',
speech_contexts=[speech_context]
)
# Result: Google STT is biased toward "kataba" pronunciation
return response
```
**Diacritic-Aware Similarity Scoring**
Similarity scoring distinguishes diacritized from undiacritized:
```python
def compare_pronunciations(expected, actual):
"""
expected: "كَتَبَ" (with diacritics)
actual: "كتب" (child's attempt, possibly undiacritized)
"""
# Strip diacritics for coarse comparison
expected_base = strip_diacritics(expected) # "كتب"
actual_base = strip_diacritics(actual) # "كتب"
# Base similarity (ignoring diacritics)
base_similarity = string_similarity(expected_base, actual_base) # 1.0 (perfect)
# Diacritical bonus (if child's attempt includes diacritics)
diacritic_bonus = 0.0
if has_diacritics(actual):
diacritic_accuracy = diacritics_match_ratio(expected, actual)
diacritic_bonus = diacritic_accuracy * 0.15 # Up to +15% for correct diacritics
# Final score
final_score = min(base_similarity + diacritic_bonus, 1.0)
return {
'base_score': base_similarity,
'diacritic_bonus': diacritic_bonus,
'final_score': final_score,
'feedback': 'Great! Pronunciation is perfect. Next, practice the diacritical marks.'
}
```
This means:
- Child says "كتب" (undiacritized) → 85-90% score (correct base, missing diacritics)
- Child says "كَتَبَ" (fully diacritized) → 98%+ score (perfect)
- Progression is clear: first master base pronunciation, then add diacritical subtlety
### RTL Rendering Challenges
**Text Direction Management**
```dart
// lib/src/screens/lesson_screen.dart
Column(
children: [
Directionality(
textDirection: TextDirection.rtl, // For Arabic text
child: Text(
'كَتَبَ',
textAlign: TextAlign.right, // Right-aligned for RTL
style: TextStyle(
fontFamily: 'IBMPlexSansArabic',
fontSize: 36,
height: 1.8, // Extra line height for diacritics
),
),
),
// English instructions below
Directionality(
textDirection: TextDirection.ltr, // For English
child: Text(
'Pronounce: "he wrote"',
textAlign: TextAlign.left, // Left-aligned for LTR
),
),
],
)
```
**Connected Letter Shaping**
Arabic letters change form depending on position:
- Isolated: "ك" (Kaf)
- Initial: "كَـــ" (Kaf at start of word)
- Medial: "ـــكَـــ" (Kaf in middle)
- Final: "ـــكَ" (Kaf at end)
The IBMPlexSansArabic font handles shaping automatically, but we need proper Unicode sequences:
```dart
// Correct: Uses Unicode joining characters
String word = 'ك' + '\u0640' + 'ت' + '\u0640' + 'ب'; // Kashida (extension character)
// Incorrect: Direct concatenation
String word = 'ك' + 'ت' + 'ب'; // May not shape correctly on all devices
```
### Bidirectional Text Mixing
When English and Arabic appear together:
```dart
RichText(
textDirection: TextDirection.rtl, // Overall RTL
text: TextSpan(
children: [
TextSpan(text: 'means ', style: englishStyle), // LTR
TextSpan(text: 'كتاب', style: arabicStyle), // RTL
TextSpan(text: ' (book)', style: englishStyle), // LTR
],
),
)
```
Result: "means كتاب (book)" displayed with correct bidirectional flow.
### FAQ
**Q: Why force diacritics on beginner learners? Doesn't that make it harder?**
A: Initially, yes. But learning with diacritics creates stronger letter-sound associations. Research shows diacritical learning produces faster fluency. After mastery with diacritics, reading without them is natural progression.
**Q: What if my child's keyboard doesn't support typing diacritics?**
A: The app never asks children to type diacritics. Recognition and pronunciation are speech-based. Only adults (teachers, content creators) need to input diacritics, and they use specialized Arabic keyboards.
**Q: Does Amal support non-standard diacritical combinations?**
A: We support all Unicode-standardized combinations. Rare or custom combinations may not render correctly, but standard Quranic and modern Arabic are fully supported.


