Files

Zeya Phyo f51ac4afa4 Add web admin features + fix scraper & translator

Frontend changes:
- Add /admin dashboard for article management
- Add AdminButton component (Alt+Shift+A on articles)
- Add /api/admin/article API endpoints

Backend improvements:
- scraper_v2.py: Multi-layer fallback extraction (newspaper → trafilatura → readability)
- translator_v2.py: Better chunking, repetition detection, validation
- admin_tools.py: CLI admin commands
- test_scraper.py: Individual source testing

Docs:
- WEB-ADMIN-GUIDE.md: Web admin usage
- ADMIN-GUIDE.md: CLI admin usage
- SCRAPER-IMPROVEMENT-PLAN.md: Scraper fixes details
- TRANSLATION-FIX.md: Translation improvements
- ADMIN-FEATURES-SUMMARY.md: Implementation summary

Fixes:
- Article scraping from 0 → 96+ articles working
- Translation quality issues (repetition, truncation)
- Added 13 new RSS sources

2026-02-26 09:17:50 +00:00

4.6 KiB

Raw Blame History

Translation Fix - Article 50

Date: 2026-02-26
Issue: Incomplete/truncated Burmese translation
Status: 🔧 FIXING NOW

🔍 Problem Identified

Article: https://burmddit.com/article/k-n-tteaa-k-ai-athk-ttn-k-n-p-uuttaauii-n-eaak-nai-robotics-ck-rup-k-l-ttai-ang-g-ng-niiyaattc-yeaak

Symptoms:

English content: 51,244 characters
Burmese translation: 3,400 characters (only 6.6% translated!)
Translation ends with repetitive hallucinated text: "ဘာမှ မပြင်ဆင်ပဲ" (repeated 100+ times)

🐛 Root Cause

The old translator (translator.py) had several issues:

Chunk size too large (2000 chars)
- Combined with prompt overhead, exceeded Claude token limits
- Caused translations to truncate mid-way
No hallucination detection
- When Claude hit limits, it started repeating text
- No validation to catch this
No length validation
- Didn't check if translated text was reasonable length
- Accepted broken translations
Poor error recovery
- Once a chunk failed, rest of article wasn't translated

✅ Solution Implemented

Created translator_v2.py with major improvements:

1. Smarter Chunking

# OLD: 2000 char chunks (too large)
chunk_size = 2000

# NEW: 1200 char chunks (safer)
chunk_size = 1200

# BONUS: Handles long paragraphs better
- Splits by paragraphs first
- If paragraph > chunk_size, splits by sentences
- Ensures clean breaks

2. Repetition Detection

def detect_repetition(text, threshold=5):
    # Looks for 5-word sequences repeated 3+ times
    # If found → RETRY with lower temperature

3. Translation Validation

def validate_translation(translated, original):
    ✓ Check not empty (>50 chars)
    ✓ Check has Burmese Unicode
    ✓ Check length ratio (0.3 - 3.0 of original)
    ✓ Check no repetition/loops

4. Better Prompting

# Added explicit anti-repetition instruction:
"🚫 CRITICAL: DO NOT REPEAT TEXT OR GET STUCK IN LOOPS!
- If you start repeating, STOP immediately
- Translate fully but concisely
- Each sentence should be unique"

5. Retry Logic

# If translation has repetition:
1. Detect repetition
2. Retry with temperature=0.3 (lower, more focused)
3. If still fails, log warning and use fallback

📊 Current Status

Re-translating article 50 now with improved translator:

Article length: 51,244 chars
Expected chunks: ~43 chunks (at 1200 chars each)
Estimated time: ~8-10 minutes
Progress: Running...

🎯 Expected Results

After fix:

Full translation (~25,000-35,000 Burmese chars, ~50-70% of English)
No repetition or loops
Clean, readable Burmese text
Proper formatting preserved

🚀 Deployment

Pipeline updated:

# run_pipeline.py now uses:
from translator_v2 import run_translator  # ✅ Improved version

Backups:

translator_old.py - original version (backup)
translator_v2.py - improved version (active)

All future articles will use the improved translator automatically.

🔄 Manual Fix Script

Created fix_article_50.py to re-translate broken article:

cd /home/ubuntu/.openclaw/workspace/burmddit/backend
python3 fix_article_50.py 50

What it does:

Fetches article from database
Re-translates with translator_v2
Validates translation quality
Updates database only if validation passes

📋 Next Steps

✅ Wait for article 50 re-translation to complete (~10 min)
✅ Verify on website that translation is fixed
✅ Check tomorrow's automated pipeline run (1 AM UTC)
🔄 If other articles have similar issues, can run fix script for them too

🎓 Lessons Learned

Always validate LLM output
- Check for hallucinations/loops
- Validate length ratios
- Test edge cases (very long content)
Conservative chunking
- Smaller chunks = safer
- Better to have more API calls than broken output
Explicit anti-repetition prompts
- LLMs need clear instructions not to loop
- Lower temperature helps prevent hallucinations
Retry with different parameters
- If first attempt fails, try again with adjusted settings
- Temperature 0.3 is more focused than 0.5

📈 Impact

Before fix:

1/87 articles with broken translation (1.15%)
Very long articles at risk