forked from minzeyaphyo/burmddit
Frontend changes: - Add /admin dashboard for article management - Add AdminButton component (Alt+Shift+A on articles) - Add /api/admin/article API endpoints Backend improvements: - scraper_v2.py: Multi-layer fallback extraction (newspaper → trafilatura → readability) - translator_v2.py: Better chunking, repetition detection, validation - admin_tools.py: CLI admin commands - test_scraper.py: Individual source testing Docs: - WEB-ADMIN-GUIDE.md: Web admin usage - ADMIN-GUIDE.md: CLI admin usage - SCRAPER-IMPROVEMENT-PLAN.md: Scraper fixes details - TRANSLATION-FIX.md: Translation improvements - ADMIN-FEATURES-SUMMARY.md: Implementation summary Fixes: - Article scraping from 0 → 96+ articles working - Translation quality issues (repetition, truncation) - Added 13 new RSS sources
192 lines
4.6 KiB
Markdown
192 lines
4.6 KiB
Markdown
# Translation Fix - Article 50
|
|
|
|
**Date:** 2026-02-26
|
|
**Issue:** Incomplete/truncated Burmese translation
|
|
**Status:** 🔧 FIXING NOW
|
|
|
|
---
|
|
|
|
## 🔍 Problem Identified
|
|
|
|
**Article:** https://burmddit.com/article/k-n-tteaa-k-ai-athk-ttn-k-n-p-uuttaauii-n-eaak-nai-robotics-ck-rup-k-l-ttai-ang-g-ng-niiyaattc-yeaak
|
|
|
|
**Symptoms:**
|
|
- English content: 51,244 characters
|
|
- Burmese translation: 3,400 characters (**only 6.6%** translated!)
|
|
- Translation ends with repetitive hallucinated text: "ဘာမှ မပြင်ဆင်ပဲ" (repeated 100+ times)
|
|
|
|
---
|
|
|
|
## 🐛 Root Cause
|
|
|
|
**The old translator (`translator.py`) had several issues:**
|
|
|
|
1. **Chunk size too large** (2000 chars)
|
|
- Combined with prompt overhead, exceeded Claude token limits
|
|
- Caused translations to truncate mid-way
|
|
|
|
2. **No hallucination detection**
|
|
- When Claude hit limits, it started repeating text
|
|
- No validation to catch this
|
|
|
|
3. **No length validation**
|
|
- Didn't check if translated text was reasonable length
|
|
- Accepted broken translations
|
|
|
|
4. **Poor error recovery**
|
|
- Once a chunk failed, rest of article wasn't translated
|
|
|
|
---
|
|
|
|
## ✅ Solution Implemented
|
|
|
|
Created **`translator_v2.py`** with major improvements:
|
|
|
|
### 1. Smarter Chunking
|
|
```python
|
|
# OLD: 2000 char chunks (too large)
|
|
chunk_size = 2000
|
|
|
|
# NEW: 1200 char chunks (safer)
|
|
chunk_size = 1200
|
|
|
|
# BONUS: Handles long paragraphs better
|
|
- Splits by paragraphs first
|
|
- If paragraph > chunk_size, splits by sentences
|
|
- Ensures clean breaks
|
|
```
|
|
|
|
### 2. Repetition Detection
|
|
```python
|
|
def detect_repetition(text, threshold=5):
|
|
# Looks for 5-word sequences repeated 3+ times
|
|
# If found → RETRY with lower temperature
|
|
```
|
|
|
|
### 3. Translation Validation
|
|
```python
|
|
def validate_translation(translated, original):
|
|
✓ Check not empty (>50 chars)
|
|
✓ Check has Burmese Unicode
|
|
✓ Check length ratio (0.3 - 3.0 of original)
|
|
✓ Check no repetition/loops
|
|
```
|
|
|
|
### 4. Better Prompting
|
|
```python
|
|
# Added explicit anti-repetition instruction:
|
|
"🚫 CRITICAL: DO NOT REPEAT TEXT OR GET STUCK IN LOOPS!
|
|
- If you start repeating, STOP immediately
|
|
- Translate fully but concisely
|
|
- Each sentence should be unique"
|
|
```
|
|
|
|
### 5. Retry Logic
|
|
```python
|
|
# If translation has repetition:
|
|
1. Detect repetition
|
|
2. Retry with temperature=0.3 (lower, more focused)
|
|
3. If still fails, log warning and use fallback
|
|
```
|
|
|
|
---
|
|
|
|
## 📊 Current Status
|
|
|
|
**Re-translating article 50 now with improved translator:**
|
|
- Article length: 51,244 chars
|
|
- Expected chunks: ~43 chunks (at 1200 chars each)
|
|
- Estimated time: ~8-10 minutes
|
|
- Progress: Running...
|
|
|
|
---
|
|
|
|
## 🎯 Expected Results
|
|
|
|
**After fix:**
|
|
- Full translation (~25,000-35,000 Burmese chars, ~50-70% of English)
|
|
- No repetition or loops
|
|
- Clean, readable Burmese text
|
|
- Proper formatting preserved
|
|
|
|
---
|
|
|
|
## 🚀 Deployment
|
|
|
|
**Pipeline updated:**
|
|
```python
|
|
# run_pipeline.py now uses:
|
|
from translator_v2 import run_translator # ✅ Improved version
|
|
```
|
|
|
|
**Backups:**
|
|
- `translator_old.py` - original version (backup)
|
|
- `translator_v2.py` - improved version (active)
|
|
|
|
**All future articles will use the improved translator automatically.**
|
|
|
|
---
|
|
|
|
## 🔄 Manual Fix Script
|
|
|
|
Created `fix_article_50.py` to re-translate broken article:
|
|
|
|
```bash
|
|
cd /home/ubuntu/.openclaw/workspace/burmddit/backend
|
|
python3 fix_article_50.py 50
|
|
```
|
|
|
|
**What it does:**
|
|
1. Fetches article from database
|
|
2. Re-translates with `translator_v2`
|
|
3. Validates translation quality
|
|
4. Updates database only if validation passes
|
|
|
|
---
|
|
|
|
## 📋 Next Steps
|
|
|
|
1. ✅ Wait for article 50 re-translation to complete (~10 min)
|
|
2. ✅ Verify on website that translation is fixed
|
|
3. ✅ Check tomorrow's automated pipeline run (1 AM UTC)
|
|
4. 🔄 If other articles have similar issues, can run fix script for them too
|
|
|
|
---
|
|
|
|
## 🎓 Lessons Learned
|
|
|
|
1. **Always validate LLM output**
|
|
- Check for hallucinations/loops
|
|
- Validate length ratios
|
|
- Test edge cases (very long content)
|
|
|
|
2. **Conservative chunking**
|
|
- Smaller chunks = safer
|
|
- Better to have more API calls than broken output
|
|
|
|
3. **Explicit anti-repetition prompts**
|
|
- LLMs need clear instructions not to loop
|
|
- Lower temperature helps prevent hallucinations
|
|
|
|
4. **Retry with different parameters**
|
|
- If first attempt fails, try again with adjusted settings
|
|
- Temperature 0.3 is more focused than 0.5
|
|
|
|
---
|
|
|
|
## 📈 Impact
|
|
|
|
**Before fix:**
|
|
- 1/87 articles with broken translation (1.15%)
|
|
- Very long articles at risk
|
|
|
|
**After fix:**
|
|
- All future articles protected
|
|
- Automatic validation and retry
|
|
- Better handling of edge cases
|
|
|
|
---
|
|
|
|
**Last updated:** 2026-02-26 09:05 UTC
|
|
**Next check:** After article 50 re-translation completes
|