Add web admin features + fix scraper & translator

Frontend changes: - Add /admin dashboard for article management - Add AdminButton component (Alt+Shift+A on articles) - Add /api/admin/article API endpoints Backend improvements: - scraper_v2.py: Multi-layer fallback extraction (newspaper → trafilatura → readability) - translator_v2.py: Better chunking, repetition detection, validation - admin_tools.py: CLI admin commands - test_scraper.py: Individual source testing Docs: - WEB-ADMIN-GUIDE.md: Web admin usage - ADMIN-GUIDE.md: CLI admin usage - SCRAPER-IMPROVEMENT-PLAN.md: Scraper fixes details - TRANSLATION-FIX.md: Translation improvements - ADMIN-FEATURES-SUMMARY.md: Implementation summary Fixes: - Article scraping from 0 → 96+ articles working - Translation quality issues (repetition, truncation) - Added 13 new RSS sources
2026-02-26 09:17:50 +00:00
parent 8bf5f342cd
commit f51ac4afa4
20 changed files with 4769 additions and 23 deletions
--- a/NEXT-STEPS.md
+++ b/NEXT-STEPS.md
@@ -0,0 +1,248 @@
+# 🚀 Burmddit: Next Steps (START HERE)
+
+**Created:** 2026-02-26  
+**Priority:** 🔥 CRITICAL  
+**Status:** Action Required
+
+---
+
+## 🎯 The Problem
+
+**burmddit.com is broken:**
+- ❌ 0 articles scraped in the last 5 days
+- ❌ Stuck at 87 articles (last update: Feb 21)
+- ❌ All 8 news sources failing
+- ❌ Pipeline runs daily but produces nothing
+
+**Root cause:** `newspaper3k` library failures + scraping errors
+
+---
+
+## ✅ What I've Done (Last 30 minutes)
+
+### 1. Research & Analysis
+- ✅ Identified all scraper errors from logs
+- ✅ Researched 100+ AI news RSS feeds
+- ✅ Found 22 high-quality new sources to add
+
+### 2. Planning Documents Created
+- ✅ `SCRAPER-IMPROVEMENT-PLAN.md` - Detailed technical plan
+- ✅ `BURMDDIT-TASKS.md` - Day-by-day task tracker
+- ✅ `NEXT-STEPS.md` - This file (action plan)
+
+### 3. Monitoring Scripts Created
+- ✅ `scripts/check-pipeline-health.sh` - Quick health check
+- ✅ `scripts/source-stats.py` - Source performance stats
+- ✅ Updated `HEARTBEAT.md` - Auto-monitoring every 2 hours
+
+---
+
+## 🔥 What Needs to Happen Next (Priority Order)
+
+### TODAY (Next 4 hours)
+
+**1. Install dependencies** (5 min)
+```bash
+cd /home/ubuntu/.openclaw/workspace/burmddit/backend
+pip3 install trafilatura readability-lxml fake-useragent lxml_html_clean
+```
+
+**2. Create improved scraper** (2 hours)
+- File: `backend/scraper_v2.py`
+- Features:
+  - Multi-method extraction (newspaper → trafilatura → beautifulsoup)
+  - User agent rotation
+  - Better error handling
+  - Retry logic with exponential backoff
+
+**3. Test individual sources** (1 hour)
+- Create `test_source.py` script
+- Test each of 8 existing sources
+- Identify which ones work
+
+**4. Update config** (10 min)
+- Disable broken sources
+- Keep only working ones
+
+**5. Test run** (90 min)
+```bash
+cd /home/ubuntu/.openclaw/workspace/burmddit/backend
+python3 run_pipeline.py
+```
+- Target: At least 10 articles scraped
+- If successful → deploy for tomorrow's cron
+
+### TOMORROW (Day 2)
+
+**Morning:**
+- Check overnight cron results
+- Fix any new errors
+
+**Afternoon:**
+- Add 5 high-priority new sources:
+  - OpenAI Blog
+  - Anthropic Blog
+  - Hugging Face Blog
+  - Google AI Blog
+  - MarkTechPost
+- Test evening run (target: 25+ articles)
+
+### DAY 3
+
+- Add remaining 17 new sources (30 total)
+- Full test with all sources
+- Verify monitoring works
+
+### DAYS 4-7 (If time permits)
+
+- Parallel scraping (reduce runtime 90min → 40min)
+- Source health scoring
+- Image extraction improvements
+- Translation quality enhancements
+
+---
+
+## 📋 Key Files to Review
+
+### Planning Docs
+1. **`SCRAPER-IMPROVEMENT-PLAN.md`** - Full technical plan
+   - Current issues explained
+   - 22 new RSS sources listed
+   - Implementation details
+   - Success metrics
+
+2. **`BURMDDIT-TASKS.md`** - Task tracker
+   - Day-by-day breakdown
+   - Checkboxes for tracking progress
+   - Daily checklist
+   - Success criteria
+
+### Code Files (To Be Created)
+1. `backend/scraper_v2.py` - New scraper (URGENT)
+2. `backend/test_source.py` - Source tester
+3. `scripts/check-pipeline-health.sh` - Health monitor ✅ (done)
+4. `scripts/source-stats.py` - Stats reporter ✅ (done)
+
+### Config Files
+1. `backend/config.py` - Source configuration
+2. `backend/.env` - Environment variables (API keys)
+
+---
+
+## 🎯 Success Criteria
+
+### Immediate (Today)
+- ✅ At least 10 articles scraped in test run
+- ✅ At least 3 sources working
+- ✅ Pipeline completes without crashing
+
+### Day 3
+- ✅ 30+ sources configured
+- ✅ 40+ articles scraped per run
+- ✅ <5% error rate
+
+### Week 1
+- ✅ 30-40 articles published daily
+- ✅ 25/30 sources active
+- ✅ 95%+ pipeline success rate
+- ✅ Automatic monitoring working
+
+---
+
+## 🚨 Critical Path
+
+**BLOCKER:** Scraper must be fixed TODAY for tomorrow's 1 AM UTC cron run.
+
+**Timeline:**
+- Now → +2h: Build `scraper_v2.py`
+- +2h → +3h: Test sources
+- +3h → +4.5h: Full pipeline test
+- +4.5h: Deploy if successful
+
+If delayed, website stays broken for another day = lost traffic.
+
+---
+
+## 📊 New Sources to Add (Top 10)
+
+These are the highest-quality sources to prioritize:
+
+1. **OpenAI Blog** - `https://openai.com/blog/rss/`
+2. **Anthropic Blog** - `https://www.anthropic.com/rss`
+3. **Hugging Face** - `https://huggingface.co/blog/feed.xml`
+4. **Google AI** - `http://googleaiblog.blogspot.com/atom.xml`
+5. **MarkTechPost** - `https://www.marktechpost.com/feed/`
+6. **The Rundown AI** - `https://rss.beehiiv.com/feeds/2R3C6Bt5wj.xml`
+7. **Last Week in AI** - `https://lastweekin.ai/feed`
+8. **Analytics India Magazine** - `https://analyticsindiamag.com/feed/`
+9. **AI News** - `https://www.artificialintelligence-news.com/feed/rss/`
+10. **KDnuggets** - `https://www.kdnuggets.com/feed`
+
+(Full list of 22 sources in `SCRAPER-IMPROVEMENT-PLAN.md`)
+
+---
+
+## 🤖 Automatic Monitoring
+
+**I've set up automatic health checks:**
+
+- **Heartbeat monitoring** (every 2 hours)
+  - Runs `scripts/check-pipeline-health.sh`
+  - Alerts if: zero articles, high errors, or stale pipeline
+  
+- **Daily checklist** (9 AM Singapore time)
+  - Check overnight cron results
+  - Review errors
+  - Update task tracker
+  - Report status
+
+**You'll be notified automatically if:**
+- Pipeline fails
+- Article count drops below 10
+- Error rate exceeds 50
+- No run in 36+ hours
+
+---
+
+## 💬 Questions to Decide
+
+1. **Should I start building `scraper_v2.py` now?**
+   - Or do you want to review the plan first?
+
+2. **Do you want to add all 22 sources at once, or gradually?**
+   - Recommendation: Start with top 10, then expand
+
+3. **Should I deploy the fix automatically or ask first?**
+   - Recommendation: Test first, then ask before deploying
+
+4. **Priority: Speed or perfection?**
+   - Option A: Quick fix (2-4 hours, basic functionality)
+   - Option B: Proper rebuild (1-2 days, all optimizations)
+
+---
+
+## 📞 Contact
+
+**Owner:** Zeya Phyo  
+**Developer:** Bob  
+**Deadline:** ASAP (ideally today)
+
+**Current time:** 2026-02-26 08:30 UTC (4:30 PM Singapore)
+
+---
+
+## 🚀 Ready to Start?
+
+**Recommended action:** Let me start building `scraper_v2.py` now.
+
+**Command to kick off:**
+```
+Yes, start fixing the scraper now
+```
+
+Or if you want to review the plan first:
+```
+Show me the technical details of scraper_v2.py first
+```
+
+**All planning documents are ready. Just need your go-ahead to execute. 🎯**