Add web admin features + fix scraper & translator

Frontend changes:
- Add /admin dashboard for article management
- Add AdminButton component (Alt+Shift+A on articles)
- Add /api/admin/article API endpoints

Backend improvements:
- scraper_v2.py: Multi-layer fallback extraction (newspaper → trafilatura → readability)
- translator_v2.py: Better chunking, repetition detection, validation
- admin_tools.py: CLI admin commands
- test_scraper.py: Individual source testing

Docs:
- WEB-ADMIN-GUIDE.md: Web admin usage
- ADMIN-GUIDE.md: CLI admin usage
- SCRAPER-IMPROVEMENT-PLAN.md: Scraper fixes details
- TRANSLATION-FIX.md: Translation improvements
- ADMIN-FEATURES-SUMMARY.md: Implementation summary

Fixes:
- Article scraping from 0 → 96+ articles working
- Translation quality issues (repetition, truncation)
- Added 13 new RSS sources
This commit is contained in:
Zeya Phyo
2026-02-26 09:17:50 +00:00
parent 8bf5f342cd
commit f51ac4afa4
20 changed files with 4769 additions and 23 deletions

248
NEXT-STEPS.md Normal file
View File

@@ -0,0 +1,248 @@
# 🚀 Burmddit: Next Steps (START HERE)
**Created:** 2026-02-26
**Priority:** 🔥 CRITICAL
**Status:** Action Required
---
## 🎯 The Problem
**burmddit.com is broken:**
- ❌ 0 articles scraped in the last 5 days
- ❌ Stuck at 87 articles (last update: Feb 21)
- ❌ All 8 news sources failing
- ❌ Pipeline runs daily but produces nothing
**Root cause:** `newspaper3k` library failures + scraping errors
---
## ✅ What I've Done (Last 30 minutes)
### 1. Research & Analysis
- ✅ Identified all scraper errors from logs
- ✅ Researched 100+ AI news RSS feeds
- ✅ Found 22 high-quality new sources to add
### 2. Planning Documents Created
-`SCRAPER-IMPROVEMENT-PLAN.md` - Detailed technical plan
-`BURMDDIT-TASKS.md` - Day-by-day task tracker
-`NEXT-STEPS.md` - This file (action plan)
### 3. Monitoring Scripts Created
-`scripts/check-pipeline-health.sh` - Quick health check
-`scripts/source-stats.py` - Source performance stats
- ✅ Updated `HEARTBEAT.md` - Auto-monitoring every 2 hours
---
## 🔥 What Needs to Happen Next (Priority Order)
### TODAY (Next 4 hours)
**1. Install dependencies** (5 min)
```bash
cd /home/ubuntu/.openclaw/workspace/burmddit/backend
pip3 install trafilatura readability-lxml fake-useragent lxml_html_clean
```
**2. Create improved scraper** (2 hours)
- File: `backend/scraper_v2.py`
- Features:
- Multi-method extraction (newspaper → trafilatura → beautifulsoup)
- User agent rotation
- Better error handling
- Retry logic with exponential backoff
**3. Test individual sources** (1 hour)
- Create `test_source.py` script
- Test each of 8 existing sources
- Identify which ones work
**4. Update config** (10 min)
- Disable broken sources
- Keep only working ones
**5. Test run** (90 min)
```bash
cd /home/ubuntu/.openclaw/workspace/burmddit/backend
python3 run_pipeline.py
```
- Target: At least 10 articles scraped
- If successful → deploy for tomorrow's cron
### TOMORROW (Day 2)
**Morning:**
- Check overnight cron results
- Fix any new errors
**Afternoon:**
- Add 5 high-priority new sources:
- OpenAI Blog
- Anthropic Blog
- Hugging Face Blog
- Google AI Blog
- MarkTechPost
- Test evening run (target: 25+ articles)
### DAY 3
- Add remaining 17 new sources (30 total)
- Full test with all sources
- Verify monitoring works
### DAYS 4-7 (If time permits)
- Parallel scraping (reduce runtime 90min → 40min)
- Source health scoring
- Image extraction improvements
- Translation quality enhancements
---
## 📋 Key Files to Review
### Planning Docs
1. **`SCRAPER-IMPROVEMENT-PLAN.md`** - Full technical plan
- Current issues explained
- 22 new RSS sources listed
- Implementation details
- Success metrics
2. **`BURMDDIT-TASKS.md`** - Task tracker
- Day-by-day breakdown
- Checkboxes for tracking progress
- Daily checklist
- Success criteria
### Code Files (To Be Created)
1. `backend/scraper_v2.py` - New scraper (URGENT)
2. `backend/test_source.py` - Source tester
3. `scripts/check-pipeline-health.sh` - Health monitor ✅ (done)
4. `scripts/source-stats.py` - Stats reporter ✅ (done)
### Config Files
1. `backend/config.py` - Source configuration
2. `backend/.env` - Environment variables (API keys)
---
## 🎯 Success Criteria
### Immediate (Today)
- ✅ At least 10 articles scraped in test run
- ✅ At least 3 sources working
- ✅ Pipeline completes without crashing
### Day 3
- ✅ 30+ sources configured
- ✅ 40+ articles scraped per run
- ✅ <5% error rate
### Week 1
- ✅ 30-40 articles published daily
- ✅ 25/30 sources active
- ✅ 95%+ pipeline success rate
- ✅ Automatic monitoring working
---
## 🚨 Critical Path
**BLOCKER:** Scraper must be fixed TODAY for tomorrow's 1 AM UTC cron run.
**Timeline:**
- Now → +2h: Build `scraper_v2.py`
- +2h → +3h: Test sources
- +3h → +4.5h: Full pipeline test
- +4.5h: Deploy if successful
If delayed, website stays broken for another day = lost traffic.
---
## 📊 New Sources to Add (Top 10)
These are the highest-quality sources to prioritize:
1. **OpenAI Blog** - `https://openai.com/blog/rss/`
2. **Anthropic Blog** - `https://www.anthropic.com/rss`
3. **Hugging Face** - `https://huggingface.co/blog/feed.xml`
4. **Google AI** - `http://googleaiblog.blogspot.com/atom.xml`
5. **MarkTechPost** - `https://www.marktechpost.com/feed/`
6. **The Rundown AI** - `https://rss.beehiiv.com/feeds/2R3C6Bt5wj.xml`
7. **Last Week in AI** - `https://lastweekin.ai/feed`
8. **Analytics India Magazine** - `https://analyticsindiamag.com/feed/`
9. **AI News** - `https://www.artificialintelligence-news.com/feed/rss/`
10. **KDnuggets** - `https://www.kdnuggets.com/feed`
(Full list of 22 sources in `SCRAPER-IMPROVEMENT-PLAN.md`)
---
## 🤖 Automatic Monitoring
**I've set up automatic health checks:**
- **Heartbeat monitoring** (every 2 hours)
- Runs `scripts/check-pipeline-health.sh`
- Alerts if: zero articles, high errors, or stale pipeline
- **Daily checklist** (9 AM Singapore time)
- Check overnight cron results
- Review errors
- Update task tracker
- Report status
**You'll be notified automatically if:**
- Pipeline fails
- Article count drops below 10
- Error rate exceeds 50
- No run in 36+ hours
---
## 💬 Questions to Decide
1. **Should I start building `scraper_v2.py` now?**
- Or do you want to review the plan first?
2. **Do you want to add all 22 sources at once, or gradually?**
- Recommendation: Start with top 10, then expand
3. **Should I deploy the fix automatically or ask first?**
- Recommendation: Test first, then ask before deploying
4. **Priority: Speed or perfection?**
- Option A: Quick fix (2-4 hours, basic functionality)
- Option B: Proper rebuild (1-2 days, all optimizations)
---
## 📞 Contact
**Owner:** Zeya Phyo
**Developer:** Bob
**Deadline:** ASAP (ideally today)
**Current time:** 2026-02-26 08:30 UTC (4:30 PM Singapore)
---
## 🚀 Ready to Start?
**Recommended action:** Let me start building `scraper_v2.py` now.
**Command to kick off:**
```
Yes, start fixing the scraper now
```
Or if you want to review the plan first:
```
Show me the technical details of scraper_v2.py first
```
**All planning documents are ready. Just need your go-ahead to execute. 🎯**