forked from minzeyaphyo/burmddit
Frontend changes: - Add /admin dashboard for article management - Add AdminButton component (Alt+Shift+A on articles) - Add /api/admin/article API endpoints Backend improvements: - scraper_v2.py: Multi-layer fallback extraction (newspaper → trafilatura → readability) - translator_v2.py: Better chunking, repetition detection, validation - admin_tools.py: CLI admin commands - test_scraper.py: Individual source testing Docs: - WEB-ADMIN-GUIDE.md: Web admin usage - ADMIN-GUIDE.md: CLI admin usage - SCRAPER-IMPROVEMENT-PLAN.md: Scraper fixes details - TRANSLATION-FIX.md: Translation improvements - ADMIN-FEATURES-SUMMARY.md: Implementation summary Fixes: - Article scraping from 0 → 96+ articles working - Translation quality issues (repetition, truncation) - Added 13 new RSS sources
249 lines
6.2 KiB
Markdown
249 lines
6.2 KiB
Markdown
# 🚀 Burmddit: Next Steps (START HERE)
|
|
|
|
**Created:** 2026-02-26
|
|
**Priority:** 🔥 CRITICAL
|
|
**Status:** Action Required
|
|
|
|
---
|
|
|
|
## 🎯 The Problem
|
|
|
|
**burmddit.com is broken:**
|
|
- ❌ 0 articles scraped in the last 5 days
|
|
- ❌ Stuck at 87 articles (last update: Feb 21)
|
|
- ❌ All 8 news sources failing
|
|
- ❌ Pipeline runs daily but produces nothing
|
|
|
|
**Root cause:** `newspaper3k` library failures + scraping errors
|
|
|
|
---
|
|
|
|
## ✅ What I've Done (Last 30 minutes)
|
|
|
|
### 1. Research & Analysis
|
|
- ✅ Identified all scraper errors from logs
|
|
- ✅ Researched 100+ AI news RSS feeds
|
|
- ✅ Found 22 high-quality new sources to add
|
|
|
|
### 2. Planning Documents Created
|
|
- ✅ `SCRAPER-IMPROVEMENT-PLAN.md` - Detailed technical plan
|
|
- ✅ `BURMDDIT-TASKS.md` - Day-by-day task tracker
|
|
- ✅ `NEXT-STEPS.md` - This file (action plan)
|
|
|
|
### 3. Monitoring Scripts Created
|
|
- ✅ `scripts/check-pipeline-health.sh` - Quick health check
|
|
- ✅ `scripts/source-stats.py` - Source performance stats
|
|
- ✅ Updated `HEARTBEAT.md` - Auto-monitoring every 2 hours
|
|
|
|
---
|
|
|
|
## 🔥 What Needs to Happen Next (Priority Order)
|
|
|
|
### TODAY (Next 4 hours)
|
|
|
|
**1. Install dependencies** (5 min)
|
|
```bash
|
|
cd /home/ubuntu/.openclaw/workspace/burmddit/backend
|
|
pip3 install trafilatura readability-lxml fake-useragent lxml_html_clean
|
|
```
|
|
|
|
**2. Create improved scraper** (2 hours)
|
|
- File: `backend/scraper_v2.py`
|
|
- Features:
|
|
- Multi-method extraction (newspaper → trafilatura → beautifulsoup)
|
|
- User agent rotation
|
|
- Better error handling
|
|
- Retry logic with exponential backoff
|
|
|
|
**3. Test individual sources** (1 hour)
|
|
- Create `test_source.py` script
|
|
- Test each of 8 existing sources
|
|
- Identify which ones work
|
|
|
|
**4. Update config** (10 min)
|
|
- Disable broken sources
|
|
- Keep only working ones
|
|
|
|
**5. Test run** (90 min)
|
|
```bash
|
|
cd /home/ubuntu/.openclaw/workspace/burmddit/backend
|
|
python3 run_pipeline.py
|
|
```
|
|
- Target: At least 10 articles scraped
|
|
- If successful → deploy for tomorrow's cron
|
|
|
|
### TOMORROW (Day 2)
|
|
|
|
**Morning:**
|
|
- Check overnight cron results
|
|
- Fix any new errors
|
|
|
|
**Afternoon:**
|
|
- Add 5 high-priority new sources:
|
|
- OpenAI Blog
|
|
- Anthropic Blog
|
|
- Hugging Face Blog
|
|
- Google AI Blog
|
|
- MarkTechPost
|
|
- Test evening run (target: 25+ articles)
|
|
|
|
### DAY 3
|
|
|
|
- Add remaining 17 new sources (30 total)
|
|
- Full test with all sources
|
|
- Verify monitoring works
|
|
|
|
### DAYS 4-7 (If time permits)
|
|
|
|
- Parallel scraping (reduce runtime 90min → 40min)
|
|
- Source health scoring
|
|
- Image extraction improvements
|
|
- Translation quality enhancements
|
|
|
|
---
|
|
|
|
## 📋 Key Files to Review
|
|
|
|
### Planning Docs
|
|
1. **`SCRAPER-IMPROVEMENT-PLAN.md`** - Full technical plan
|
|
- Current issues explained
|
|
- 22 new RSS sources listed
|
|
- Implementation details
|
|
- Success metrics
|
|
|
|
2. **`BURMDDIT-TASKS.md`** - Task tracker
|
|
- Day-by-day breakdown
|
|
- Checkboxes for tracking progress
|
|
- Daily checklist
|
|
- Success criteria
|
|
|
|
### Code Files (To Be Created)
|
|
1. `backend/scraper_v2.py` - New scraper (URGENT)
|
|
2. `backend/test_source.py` - Source tester
|
|
3. `scripts/check-pipeline-health.sh` - Health monitor ✅ (done)
|
|
4. `scripts/source-stats.py` - Stats reporter ✅ (done)
|
|
|
|
### Config Files
|
|
1. `backend/config.py` - Source configuration
|
|
2. `backend/.env` - Environment variables (API keys)
|
|
|
|
---
|
|
|
|
## 🎯 Success Criteria
|
|
|
|
### Immediate (Today)
|
|
- ✅ At least 10 articles scraped in test run
|
|
- ✅ At least 3 sources working
|
|
- ✅ Pipeline completes without crashing
|
|
|
|
### Day 3
|
|
- ✅ 30+ sources configured
|
|
- ✅ 40+ articles scraped per run
|
|
- ✅ <5% error rate
|
|
|
|
### Week 1
|
|
- ✅ 30-40 articles published daily
|
|
- ✅ 25/30 sources active
|
|
- ✅ 95%+ pipeline success rate
|
|
- ✅ Automatic monitoring working
|
|
|
|
---
|
|
|
|
## 🚨 Critical Path
|
|
|
|
**BLOCKER:** Scraper must be fixed TODAY for tomorrow's 1 AM UTC cron run.
|
|
|
|
**Timeline:**
|
|
- Now → +2h: Build `scraper_v2.py`
|
|
- +2h → +3h: Test sources
|
|
- +3h → +4.5h: Full pipeline test
|
|
- +4.5h: Deploy if successful
|
|
|
|
If delayed, website stays broken for another day = lost traffic.
|
|
|
|
---
|
|
|
|
## 📊 New Sources to Add (Top 10)
|
|
|
|
These are the highest-quality sources to prioritize:
|
|
|
|
1. **OpenAI Blog** - `https://openai.com/blog/rss/`
|
|
2. **Anthropic Blog** - `https://www.anthropic.com/rss`
|
|
3. **Hugging Face** - `https://huggingface.co/blog/feed.xml`
|
|
4. **Google AI** - `http://googleaiblog.blogspot.com/atom.xml`
|
|
5. **MarkTechPost** - `https://www.marktechpost.com/feed/`
|
|
6. **The Rundown AI** - `https://rss.beehiiv.com/feeds/2R3C6Bt5wj.xml`
|
|
7. **Last Week in AI** - `https://lastweekin.ai/feed`
|
|
8. **Analytics India Magazine** - `https://analyticsindiamag.com/feed/`
|
|
9. **AI News** - `https://www.artificialintelligence-news.com/feed/rss/`
|
|
10. **KDnuggets** - `https://www.kdnuggets.com/feed`
|
|
|
|
(Full list of 22 sources in `SCRAPER-IMPROVEMENT-PLAN.md`)
|
|
|
|
---
|
|
|
|
## 🤖 Automatic Monitoring
|
|
|
|
**I've set up automatic health checks:**
|
|
|
|
- **Heartbeat monitoring** (every 2 hours)
|
|
- Runs `scripts/check-pipeline-health.sh`
|
|
- Alerts if: zero articles, high errors, or stale pipeline
|
|
|
|
- **Daily checklist** (9 AM Singapore time)
|
|
- Check overnight cron results
|
|
- Review errors
|
|
- Update task tracker
|
|
- Report status
|
|
|
|
**You'll be notified automatically if:**
|
|
- Pipeline fails
|
|
- Article count drops below 10
|
|
- Error rate exceeds 50
|
|
- No run in 36+ hours
|
|
|
|
---
|
|
|
|
## 💬 Questions to Decide
|
|
|
|
1. **Should I start building `scraper_v2.py` now?**
|
|
- Or do you want to review the plan first?
|
|
|
|
2. **Do you want to add all 22 sources at once, or gradually?**
|
|
- Recommendation: Start with top 10, then expand
|
|
|
|
3. **Should I deploy the fix automatically or ask first?**
|
|
- Recommendation: Test first, then ask before deploying
|
|
|
|
4. **Priority: Speed or perfection?**
|
|
- Option A: Quick fix (2-4 hours, basic functionality)
|
|
- Option B: Proper rebuild (1-2 days, all optimizations)
|
|
|
|
---
|
|
|
|
## 📞 Contact
|
|
|
|
**Owner:** Zeya Phyo
|
|
**Developer:** Bob
|
|
**Deadline:** ASAP (ideally today)
|
|
|
|
**Current time:** 2026-02-26 08:30 UTC (4:30 PM Singapore)
|
|
|
|
---
|
|
|
|
## 🚀 Ready to Start?
|
|
|
|
**Recommended action:** Let me start building `scraper_v2.py` now.
|
|
|
|
**Command to kick off:**
|
|
```
|
|
Yes, start fixing the scraper now
|
|
```
|
|
|
|
Or if you want to review the plan first:
|
|
```
|
|
Show me the technical details of scraper_v2.py first
|
|
```
|
|
|
|
**All planning documents are ready. Just need your go-ahead to execute. 🎯**
|