forked from minzeyaphyo/burmddit
Add web admin features + fix scraper & translator
Frontend changes: - Add /admin dashboard for article management - Add AdminButton component (Alt+Shift+A on articles) - Add /api/admin/article API endpoints Backend improvements: - scraper_v2.py: Multi-layer fallback extraction (newspaper → trafilatura → readability) - translator_v2.py: Better chunking, repetition detection, validation - admin_tools.py: CLI admin commands - test_scraper.py: Individual source testing Docs: - WEB-ADMIN-GUIDE.md: Web admin usage - ADMIN-GUIDE.md: CLI admin usage - SCRAPER-IMPROVEMENT-PLAN.md: Scraper fixes details - TRANSLATION-FIX.md: Translation improvements - ADMIN-FEATURES-SUMMARY.md: Implementation summary Fixes: - Article scraping from 0 → 96+ articles working - Translation quality issues (repetition, truncation) - Added 13 new RSS sources
This commit is contained in:
248
NEXT-STEPS.md
Normal file
248
NEXT-STEPS.md
Normal file
@@ -0,0 +1,248 @@
|
||||
# 🚀 Burmddit: Next Steps (START HERE)
|
||||
|
||||
**Created:** 2026-02-26
|
||||
**Priority:** 🔥 CRITICAL
|
||||
**Status:** Action Required
|
||||
|
||||
---
|
||||
|
||||
## 🎯 The Problem
|
||||
|
||||
**burmddit.com is broken:**
|
||||
- ❌ 0 articles scraped in the last 5 days
|
||||
- ❌ Stuck at 87 articles (last update: Feb 21)
|
||||
- ❌ All 8 news sources failing
|
||||
- ❌ Pipeline runs daily but produces nothing
|
||||
|
||||
**Root cause:** `newspaper3k` library failures + scraping errors
|
||||
|
||||
---
|
||||
|
||||
## ✅ What I've Done (Last 30 minutes)
|
||||
|
||||
### 1. Research & Analysis
|
||||
- ✅ Identified all scraper errors from logs
|
||||
- ✅ Researched 100+ AI news RSS feeds
|
||||
- ✅ Found 22 high-quality new sources to add
|
||||
|
||||
### 2. Planning Documents Created
|
||||
- ✅ `SCRAPER-IMPROVEMENT-PLAN.md` - Detailed technical plan
|
||||
- ✅ `BURMDDIT-TASKS.md` - Day-by-day task tracker
|
||||
- ✅ `NEXT-STEPS.md` - This file (action plan)
|
||||
|
||||
### 3. Monitoring Scripts Created
|
||||
- ✅ `scripts/check-pipeline-health.sh` - Quick health check
|
||||
- ✅ `scripts/source-stats.py` - Source performance stats
|
||||
- ✅ Updated `HEARTBEAT.md` - Auto-monitoring every 2 hours
|
||||
|
||||
---
|
||||
|
||||
## 🔥 What Needs to Happen Next (Priority Order)
|
||||
|
||||
### TODAY (Next 4 hours)
|
||||
|
||||
**1. Install dependencies** (5 min)
|
||||
```bash
|
||||
cd /home/ubuntu/.openclaw/workspace/burmddit/backend
|
||||
pip3 install trafilatura readability-lxml fake-useragent lxml_html_clean
|
||||
```
|
||||
|
||||
**2. Create improved scraper** (2 hours)
|
||||
- File: `backend/scraper_v2.py`
|
||||
- Features:
|
||||
- Multi-method extraction (newspaper → trafilatura → beautifulsoup)
|
||||
- User agent rotation
|
||||
- Better error handling
|
||||
- Retry logic with exponential backoff
|
||||
|
||||
**3. Test individual sources** (1 hour)
|
||||
- Create `test_source.py` script
|
||||
- Test each of 8 existing sources
|
||||
- Identify which ones work
|
||||
|
||||
**4. Update config** (10 min)
|
||||
- Disable broken sources
|
||||
- Keep only working ones
|
||||
|
||||
**5. Test run** (90 min)
|
||||
```bash
|
||||
cd /home/ubuntu/.openclaw/workspace/burmddit/backend
|
||||
python3 run_pipeline.py
|
||||
```
|
||||
- Target: At least 10 articles scraped
|
||||
- If successful → deploy for tomorrow's cron
|
||||
|
||||
### TOMORROW (Day 2)
|
||||
|
||||
**Morning:**
|
||||
- Check overnight cron results
|
||||
- Fix any new errors
|
||||
|
||||
**Afternoon:**
|
||||
- Add 5 high-priority new sources:
|
||||
- OpenAI Blog
|
||||
- Anthropic Blog
|
||||
- Hugging Face Blog
|
||||
- Google AI Blog
|
||||
- MarkTechPost
|
||||
- Test evening run (target: 25+ articles)
|
||||
|
||||
### DAY 3
|
||||
|
||||
- Add remaining 17 new sources (30 total)
|
||||
- Full test with all sources
|
||||
- Verify monitoring works
|
||||
|
||||
### DAYS 4-7 (If time permits)
|
||||
|
||||
- Parallel scraping (reduce runtime 90min → 40min)
|
||||
- Source health scoring
|
||||
- Image extraction improvements
|
||||
- Translation quality enhancements
|
||||
|
||||
---
|
||||
|
||||
## 📋 Key Files to Review
|
||||
|
||||
### Planning Docs
|
||||
1. **`SCRAPER-IMPROVEMENT-PLAN.md`** - Full technical plan
|
||||
- Current issues explained
|
||||
- 22 new RSS sources listed
|
||||
- Implementation details
|
||||
- Success metrics
|
||||
|
||||
2. **`BURMDDIT-TASKS.md`** - Task tracker
|
||||
- Day-by-day breakdown
|
||||
- Checkboxes for tracking progress
|
||||
- Daily checklist
|
||||
- Success criteria
|
||||
|
||||
### Code Files (To Be Created)
|
||||
1. `backend/scraper_v2.py` - New scraper (URGENT)
|
||||
2. `backend/test_source.py` - Source tester
|
||||
3. `scripts/check-pipeline-health.sh` - Health monitor ✅ (done)
|
||||
4. `scripts/source-stats.py` - Stats reporter ✅ (done)
|
||||
|
||||
### Config Files
|
||||
1. `backend/config.py` - Source configuration
|
||||
2. `backend/.env` - Environment variables (API keys)
|
||||
|
||||
---
|
||||
|
||||
## 🎯 Success Criteria
|
||||
|
||||
### Immediate (Today)
|
||||
- ✅ At least 10 articles scraped in test run
|
||||
- ✅ At least 3 sources working
|
||||
- ✅ Pipeline completes without crashing
|
||||
|
||||
### Day 3
|
||||
- ✅ 30+ sources configured
|
||||
- ✅ 40+ articles scraped per run
|
||||
- ✅ <5% error rate
|
||||
|
||||
### Week 1
|
||||
- ✅ 30-40 articles published daily
|
||||
- ✅ 25/30 sources active
|
||||
- ✅ 95%+ pipeline success rate
|
||||
- ✅ Automatic monitoring working
|
||||
|
||||
---
|
||||
|
||||
## 🚨 Critical Path
|
||||
|
||||
**BLOCKER:** Scraper must be fixed TODAY for tomorrow's 1 AM UTC cron run.
|
||||
|
||||
**Timeline:**
|
||||
- Now → +2h: Build `scraper_v2.py`
|
||||
- +2h → +3h: Test sources
|
||||
- +3h → +4.5h: Full pipeline test
|
||||
- +4.5h: Deploy if successful
|
||||
|
||||
If delayed, website stays broken for another day = lost traffic.
|
||||
|
||||
---
|
||||
|
||||
## 📊 New Sources to Add (Top 10)
|
||||
|
||||
These are the highest-quality sources to prioritize:
|
||||
|
||||
1. **OpenAI Blog** - `https://openai.com/blog/rss/`
|
||||
2. **Anthropic Blog** - `https://www.anthropic.com/rss`
|
||||
3. **Hugging Face** - `https://huggingface.co/blog/feed.xml`
|
||||
4. **Google AI** - `http://googleaiblog.blogspot.com/atom.xml`
|
||||
5. **MarkTechPost** - `https://www.marktechpost.com/feed/`
|
||||
6. **The Rundown AI** - `https://rss.beehiiv.com/feeds/2R3C6Bt5wj.xml`
|
||||
7. **Last Week in AI** - `https://lastweekin.ai/feed`
|
||||
8. **Analytics India Magazine** - `https://analyticsindiamag.com/feed/`
|
||||
9. **AI News** - `https://www.artificialintelligence-news.com/feed/rss/`
|
||||
10. **KDnuggets** - `https://www.kdnuggets.com/feed`
|
||||
|
||||
(Full list of 22 sources in `SCRAPER-IMPROVEMENT-PLAN.md`)
|
||||
|
||||
---
|
||||
|
||||
## 🤖 Automatic Monitoring
|
||||
|
||||
**I've set up automatic health checks:**
|
||||
|
||||
- **Heartbeat monitoring** (every 2 hours)
|
||||
- Runs `scripts/check-pipeline-health.sh`
|
||||
- Alerts if: zero articles, high errors, or stale pipeline
|
||||
|
||||
- **Daily checklist** (9 AM Singapore time)
|
||||
- Check overnight cron results
|
||||
- Review errors
|
||||
- Update task tracker
|
||||
- Report status
|
||||
|
||||
**You'll be notified automatically if:**
|
||||
- Pipeline fails
|
||||
- Article count drops below 10
|
||||
- Error rate exceeds 50
|
||||
- No run in 36+ hours
|
||||
|
||||
---
|
||||
|
||||
## 💬 Questions to Decide
|
||||
|
||||
1. **Should I start building `scraper_v2.py` now?**
|
||||
- Or do you want to review the plan first?
|
||||
|
||||
2. **Do you want to add all 22 sources at once, or gradually?**
|
||||
- Recommendation: Start with top 10, then expand
|
||||
|
||||
3. **Should I deploy the fix automatically or ask first?**
|
||||
- Recommendation: Test first, then ask before deploying
|
||||
|
||||
4. **Priority: Speed or perfection?**
|
||||
- Option A: Quick fix (2-4 hours, basic functionality)
|
||||
- Option B: Proper rebuild (1-2 days, all optimizations)
|
||||
|
||||
---
|
||||
|
||||
## 📞 Contact
|
||||
|
||||
**Owner:** Zeya Phyo
|
||||
**Developer:** Bob
|
||||
**Deadline:** ASAP (ideally today)
|
||||
|
||||
**Current time:** 2026-02-26 08:30 UTC (4:30 PM Singapore)
|
||||
|
||||
---
|
||||
|
||||
## 🚀 Ready to Start?
|
||||
|
||||
**Recommended action:** Let me start building `scraper_v2.py` now.
|
||||
|
||||
**Command to kick off:**
|
||||
```
|
||||
Yes, start fixing the scraper now
|
||||
```
|
||||
|
||||
Or if you want to review the plan first:
|
||||
```
|
||||
Show me the technical details of scraper_v2.py first
|
||||
```
|
||||
|
||||
**All planning documents are ready. Just need your go-ahead to execute. 🎯**
|
||||
Reference in New Issue
Block a user