Files
burmddit/TRANSLATION-FIX.md
Zeya Phyo f51ac4afa4 Add web admin features + fix scraper & translator
Frontend changes:
- Add /admin dashboard for article management
- Add AdminButton component (Alt+Shift+A on articles)
- Add /api/admin/article API endpoints

Backend improvements:
- scraper_v2.py: Multi-layer fallback extraction (newspaper → trafilatura → readability)
- translator_v2.py: Better chunking, repetition detection, validation
- admin_tools.py: CLI admin commands
- test_scraper.py: Individual source testing

Docs:
- WEB-ADMIN-GUIDE.md: Web admin usage
- ADMIN-GUIDE.md: CLI admin usage
- SCRAPER-IMPROVEMENT-PLAN.md: Scraper fixes details
- TRANSLATION-FIX.md: Translation improvements
- ADMIN-FEATURES-SUMMARY.md: Implementation summary

Fixes:
- Article scraping from 0 → 96+ articles working
- Translation quality issues (repetition, truncation)
- Added 13 new RSS sources
2026-02-26 09:17:50 +00:00

4.6 KiB

Translation Fix - Article 50

Date: 2026-02-26
Issue: Incomplete/truncated Burmese translation
Status: 🔧 FIXING NOW


🔍 Problem Identified

Article: https://burmddit.com/article/k-n-tteaa-k-ai-athk-ttn-k-n-p-uuttaauii-n-eaak-nai-robotics-ck-rup-k-l-ttai-ang-g-ng-niiyaattc-yeaak

Symptoms:

  • English content: 51,244 characters
  • Burmese translation: 3,400 characters (only 6.6% translated!)
  • Translation ends with repetitive hallucinated text: "ဘာမှ မပြင်ဆင်ပဲ" (repeated 100+ times)

🐛 Root Cause

The old translator (translator.py) had several issues:

  1. Chunk size too large (2000 chars)

    • Combined with prompt overhead, exceeded Claude token limits
    • Caused translations to truncate mid-way
  2. No hallucination detection

    • When Claude hit limits, it started repeating text
    • No validation to catch this
  3. No length validation

    • Didn't check if translated text was reasonable length
    • Accepted broken translations
  4. Poor error recovery

    • Once a chunk failed, rest of article wasn't translated

Solution Implemented

Created translator_v2.py with major improvements:

1. Smarter Chunking

# OLD: 2000 char chunks (too large)
chunk_size = 2000

# NEW: 1200 char chunks (safer)
chunk_size = 1200

# BONUS: Handles long paragraphs better
- Splits by paragraphs first
- If paragraph > chunk_size, splits by sentences
- Ensures clean breaks

2. Repetition Detection

def detect_repetition(text, threshold=5):
    # Looks for 5-word sequences repeated 3+ times
    # If found → RETRY with lower temperature

3. Translation Validation

def validate_translation(translated, original):
     Check not empty (>50 chars)
     Check has Burmese Unicode
     Check length ratio (0.3 - 3.0 of original)
     Check no repetition/loops

4. Better Prompting

# Added explicit anti-repetition instruction:
"🚫 CRITICAL: DO NOT REPEAT TEXT OR GET STUCK IN LOOPS!
- If you start repeating, STOP immediately
- Translate fully but concisely
- Each sentence should be unique"

5. Retry Logic

# If translation has repetition:
1. Detect repetition
2. Retry with temperature=0.3 (lower, more focused)
3. If still fails, log warning and use fallback

📊 Current Status

Re-translating article 50 now with improved translator:

  • Article length: 51,244 chars
  • Expected chunks: ~43 chunks (at 1200 chars each)
  • Estimated time: ~8-10 minutes
  • Progress: Running...

🎯 Expected Results

After fix:

  • Full translation (~25,000-35,000 Burmese chars, ~50-70% of English)
  • No repetition or loops
  • Clean, readable Burmese text
  • Proper formatting preserved

🚀 Deployment

Pipeline updated:

# run_pipeline.py now uses:
from translator_v2 import run_translator  # ✅ Improved version

Backups:

  • translator_old.py - original version (backup)
  • translator_v2.py - improved version (active)

All future articles will use the improved translator automatically.


🔄 Manual Fix Script

Created fix_article_50.py to re-translate broken article:

cd /home/ubuntu/.openclaw/workspace/burmddit/backend
python3 fix_article_50.py 50

What it does:

  1. Fetches article from database
  2. Re-translates with translator_v2
  3. Validates translation quality
  4. Updates database only if validation passes

📋 Next Steps

  1. Wait for article 50 re-translation to complete (~10 min)
  2. Verify on website that translation is fixed
  3. Check tomorrow's automated pipeline run (1 AM UTC)
  4. 🔄 If other articles have similar issues, can run fix script for them too

🎓 Lessons Learned

  1. Always validate LLM output

    • Check for hallucinations/loops
    • Validate length ratios
    • Test edge cases (very long content)
  2. Conservative chunking

    • Smaller chunks = safer
    • Better to have more API calls than broken output
  3. Explicit anti-repetition prompts

    • LLMs need clear instructions not to loop
    • Lower temperature helps prevent hallucinations
  4. Retry with different parameters

    • If first attempt fails, try again with adjusted settings
    • Temperature 0.3 is more focused than 0.5

📈 Impact

Before fix:

  • 1/87 articles with broken translation (1.15%)
  • Very long articles at risk

After fix:

  • All future articles protected
  • Automatic validation and retry
  • Better handling of edge cases

Last updated: 2026-02-26 09:05 UTC
Next check: After article 50 re-translation completes