How Amal Handles Arabic Diacritics Correctly
5 min readMohammad Shaker

How Amal Handles Arabic Diacritics Correctly

Amal handles Arabic diacritics with support for tashkeel, alef and hamza variants, and Lam-Alef ligatures in scoring and feedback.

Engineering

Quick Answer

Amal handles Arabic diacritics with support for tashkeel, alef and hamza variants, and Lam-Alef ligatures in scoring and feedback.

Amal handles the full complexity of Arabic diacritics: 8 tashkeel marks (fatha, damma, kasra, shadda, sukun, fathatan, dammatan, kasratan), 4 alef variants (standard, madda, hamza above, hamza below, wasla), 3 hamza variants (isolated, on waw, on ya), and Lam-Alef ligatures. The app's speech recognition, text rendering, and similarity scoring all treat diacritized Arabic ("كَتَبَ") differently from undiacritized Arabic ("كتب") — a critical distinction most Arabic learning apps ignore.

Why Diacritics Matter for Learning

The Ambiguity Problem

Arabic without diacritics is ambiguous:

  • "كتب" can mean:
    • "kataba" (he wrote) — past tense
    • "kutub" (books) — plural noun
    • "kutiba" (it was written) — passive voice

All are spelled identically without diacritics. Diacritics remove ambiguity.

The Learning Progression

  1. Beginner: Learn to read WITH diacritics (easy — vowels are marked)
  2. Intermediate: Practice WITH diacritics until automatic
  3. Advanced: Gradually remove diacritics, reading becomes harder
  4. Fluent: Read without diacritics fluently (native-level reading)

Most Arabic learning apps skip step 1 — they don't teach diacritics at all, or strip them away. This teaches bad habits. Amal's progression is scientifically correct.

Our Unicode-Level Implementation

The Diacritical Marks (8 total)

// lib/src/utils/arabic_extension.dart
class ArabicExtension {
  static const Map<String, String> tashkeelMarks = {
    'FATHA': '\u064E',      // َ (vowel 'a')
    'DAMMA': '\u064F',      // ُ (vowel 'u')
    'KASRA': '\u0650',      // ِ (vowel 'i')
    'SUKUN': '\u0652',      // ْ (no vowel)
    'SHADDA': '\u0651',     // ّ (doubled letter)
    'FATHATAN': '\u064B',   // ً (tanween 'an')
    'DAMMATAN': '\u064C',   // ٌ (tanween 'un')
    'KASRATAN': '\u064D',   // ٍ (tanween 'in')
  };
  
  static const Map<String, String> alefVariants = {
    'ALEF_STANDARD': 'ا',      // ا
    'ALEF_WITH_MADDA': 'آ',    // آ (elongated)
    'ALEF_WITH_HAMZA_ABOVE': 'أ', // أ
    'ALEF_WITH_HAMZA_BELOW': 'إ', // إ
    'ALEF_WASLA': 'ٰ',         // ٰ (connecting alef)
  };
  
  static const Map<String, String> hamzaVariants = {
    'HAMZA_ISOLATED': 'ء',  // Standalone hamza
    'HAMZA_ON_WAW': 'ؤ',    // Hamza on waw (و + hamza)
    'HAMZA_ON_YEH': 'ئ',    // Hamza on yeh (ي + hamza)
  };
}

Quranic Diacritics and Uthmani Stops

For Thurayya, we support Quranic-specific marks:

static const Map<String, String> quranicMarks = {
  'STOP_FULL': 'ۖ',         // Full stop (‖)
  'STOP_HALF': 'ۗ',         // Half stop
  'STOP_QUA': 'ۙ',          // Qua stop
  'STOP_NECESSARY': 'ۚ',     // Necessary stop
  'TAJWEED_ELONGATION': '۝', // Elongation indicator
};

Diacritic-Aware Speech Recognition

Context Biasing with Diacritics

When a child is learning "كَتَبَ" (he wrote, past tense), we bias speech recognition toward that exact vocalization:

# src/services/stt_client.py
def recognize_with_diacritical_context(audio_bytes, expected_text):
    # expected_text = "كَتَبَ" (with diacritics)
    
    # Create speech context hint
    speech_context = {
        'phrases': [expected_text],
        'boost': 20.0  # High boost for expected text
    }
    
    # Send to Google Cloud STT
    response = google_stt_client.recognize(
        audio=audio_bytes,
        language_code='ar-SA',
        speech_contexts=[speech_context]
    )
    
    # Result: Google STT is biased toward "kataba" pronunciation
    return response

Diacritic-Aware Similarity Scoring

Similarity scoring distinguishes diacritized from undiacritized:

def compare_pronunciations(expected, actual):
    """
    expected: "كَتَبَ" (with diacritics)
    actual: "كتب" (child's attempt, possibly undiacritized)
    """
    
    # Strip diacritics for coarse comparison
    expected_base = strip_diacritics(expected)  # "كتب"
    actual_base = strip_diacritics(actual)      # "كتب"
    
    # Base similarity (ignoring diacritics)
    base_similarity = string_similarity(expected_base, actual_base)  # 1.0 (perfect)
    
    # Diacritical bonus (if child's attempt includes diacritics)
    diacritic_bonus = 0.0
    if has_diacritics(actual):
        diacritic_accuracy = diacritics_match_ratio(expected, actual)
        diacritic_bonus = diacritic_accuracy * 0.15  # Up to +15% for correct diacritics
    
    # Final score
    final_score = min(base_similarity + diacritic_bonus, 1.0)
    
    return {
        'base_score': base_similarity,
        'diacritic_bonus': diacritic_bonus,
        'final_score': final_score,
        'feedback': 'Great! Pronunciation is perfect. Next, practice the diacritical marks.'
    }

This means:

  • Child says "كتب" (undiacritized) → 85-90% score (correct base, missing diacritics)
  • Child says "كَتَبَ" (fully diacritized) → 98%+ score (perfect)
  • Progression is clear: first master base pronunciation, then add diacritical subtlety

RTL Rendering Challenges

Text Direction Management

// lib/src/screens/lesson_screen.dart
Column(
  children: [
    Directionality(
      textDirection: TextDirection.rtl,  // For Arabic text
      child: Text(
        'كَتَبَ',
        textAlign: TextAlign.right,      // Right-aligned for RTL
        style: TextStyle(
          fontFamily: 'IBMPlexSansArabic',
          fontSize: 36,
          height: 1.8,  // Extra line height for diacritics
        ),
      ),
    ),
    // English instructions below
    Directionality(
      textDirection: TextDirection.ltr,  // For English
      child: Text(
        'Pronounce: "he wrote"',
        textAlign: TextAlign.left,       // Left-aligned for LTR
      ),
    ),
  ],
)

Connected Letter Shaping

Arabic letters change form depending on position:

  • Isolated: "ك" (Kaf)
  • Initial: "كَـــ" (Kaf at start of word)
  • Medial: "ـــكَـــ" (Kaf in middle)
  • Final: "ـــكَ" (Kaf at end)

The IBMPlexSansArabic font handles shaping automatically, but we need proper Unicode sequences:

// Correct: Uses Unicode joining characters
String word = 'ك' + '\u0640' + 'ت' + '\u0640' + 'ب';  // Kashida (extension character)

// Incorrect: Direct concatenation
String word = 'ك' + 'ت' + 'ب';  // May not shape correctly on all devices

Bidirectional Text Mixing

When English and Arabic appear together:

RichText(
  textDirection: TextDirection.rtl,  // Overall RTL
  text: TextSpan(
    children: [
      TextSpan(text: 'means ', style: englishStyle),  // LTR
      TextSpan(text: 'كتاب', style: arabicStyle),    // RTL
      TextSpan(text: ' (book)', style: englishStyle), // LTR
    ],
  ),
)

Result: "means كتاب (book)" displayed with correct bidirectional flow.

FAQ

Q: Why force diacritics on beginner learners? Doesn't that make it harder? A: Initially, yes. But learning with diacritics creates stronger letter-sound associations. Research shows diacritical learning produces faster fluency. After mastery with diacritics, reading without them is natural progression.

Q: What if my child's keyboard doesn't support typing diacritics? A: The app never asks children to type diacritics. Recognition and pronunciation are speech-based. Only adults (teachers, content creators) need to input diacritics, and they use specialized Arabic keyboards.

Q: Does Amal support non-standard diacritical combinations? A: We support all Unicode-standardized combinations. Rare or custom combinations may not render correctly, but standard Quranic and modern Arabic are fully supported.

See our Arabic alphabet learning page and how Amal works for early readers.

Related Articles