# 🚀 Burmddit: Next Steps (START HERE) **Created:** 2026-02-26 **Priority:** 🔥 CRITICAL **Status:** Action Required --- ## 🎯 The Problem **burmddit.com is broken:** - ❌ 0 articles scraped in the last 5 days - ❌ Stuck at 87 articles (last update: Feb 21) - ❌ All 8 news sources failing - ❌ Pipeline runs daily but produces nothing **Root cause:** `newspaper3k` library failures + scraping errors --- ## ✅ What I've Done (Last 30 minutes) ### 1. Research & Analysis - ✅ Identified all scraper errors from logs - ✅ Researched 100+ AI news RSS feeds - ✅ Found 22 high-quality new sources to add ### 2. Planning Documents Created - ✅ `SCRAPER-IMPROVEMENT-PLAN.md` - Detailed technical plan - ✅ `BURMDDIT-TASKS.md` - Day-by-day task tracker - ✅ `NEXT-STEPS.md` - This file (action plan) ### 3. Monitoring Scripts Created - ✅ `scripts/check-pipeline-health.sh` - Quick health check - ✅ `scripts/source-stats.py` - Source performance stats - ✅ Updated `HEARTBEAT.md` - Auto-monitoring every 2 hours --- ## 🔥 What Needs to Happen Next (Priority Order) ### TODAY (Next 4 hours) **1. Install dependencies** (5 min) ```bash cd /home/ubuntu/.openclaw/workspace/burmddit/backend pip3 install trafilatura readability-lxml fake-useragent lxml_html_clean ``` **2. Create improved scraper** (2 hours) - File: `backend/scraper_v2.py` - Features: - Multi-method extraction (newspaper → trafilatura → beautifulsoup) - User agent rotation - Better error handling - Retry logic with exponential backoff **3. Test individual sources** (1 hour) - Create `test_source.py` script - Test each of 8 existing sources - Identify which ones work **4. Update config** (10 min) - Disable broken sources - Keep only working ones **5. Test run** (90 min) ```bash cd /home/ubuntu/.openclaw/workspace/burmddit/backend python3 run_pipeline.py ``` - Target: At least 10 articles scraped - If successful → deploy for tomorrow's cron ### TOMORROW (Day 2) **Morning:** - Check overnight cron results - Fix any new errors **Afternoon:** - Add 5 high-priority new sources: - OpenAI Blog - Anthropic Blog - Hugging Face Blog - Google AI Blog - MarkTechPost - Test evening run (target: 25+ articles) ### DAY 3 - Add remaining 17 new sources (30 total) - Full test with all sources - Verify monitoring works ### DAYS 4-7 (If time permits) - Parallel scraping (reduce runtime 90min → 40min) - Source health scoring - Image extraction improvements - Translation quality enhancements --- ## 📋 Key Files to Review ### Planning Docs 1. **`SCRAPER-IMPROVEMENT-PLAN.md`** - Full technical plan - Current issues explained - 22 new RSS sources listed - Implementation details - Success metrics 2. **`BURMDDIT-TASKS.md`** - Task tracker - Day-by-day breakdown - Checkboxes for tracking progress - Daily checklist - Success criteria ### Code Files (To Be Created) 1. `backend/scraper_v2.py` - New scraper (URGENT) 2. `backend/test_source.py` - Source tester 3. `scripts/check-pipeline-health.sh` - Health monitor ✅ (done) 4. `scripts/source-stats.py` - Stats reporter ✅ (done) ### Config Files 1. `backend/config.py` - Source configuration 2. `backend/.env` - Environment variables (API keys) --- ## 🎯 Success Criteria ### Immediate (Today) - ✅ At least 10 articles scraped in test run - ✅ At least 3 sources working - ✅ Pipeline completes without crashing ### Day 3 - ✅ 30+ sources configured - ✅ 40+ articles scraped per run - ✅ <5% error rate ### Week 1 - ✅ 30-40 articles published daily - ✅ 25/30 sources active - ✅ 95%+ pipeline success rate - ✅ Automatic monitoring working --- ## 🚨 Critical Path **BLOCKER:** Scraper must be fixed TODAY for tomorrow's 1 AM UTC cron run. **Timeline:** - Now → +2h: Build `scraper_v2.py` - +2h → +3h: Test sources - +3h → +4.5h: Full pipeline test - +4.5h: Deploy if successful If delayed, website stays broken for another day = lost traffic. --- ## 📊 New Sources to Add (Top 10) These are the highest-quality sources to prioritize: 1. **OpenAI Blog** - `https://openai.com/blog/rss/` 2. **Anthropic Blog** - `https://www.anthropic.com/rss` 3. **Hugging Face** - `https://huggingface.co/blog/feed.xml` 4. **Google AI** - `http://googleaiblog.blogspot.com/atom.xml` 5. **MarkTechPost** - `https://www.marktechpost.com/feed/` 6. **The Rundown AI** - `https://rss.beehiiv.com/feeds/2R3C6Bt5wj.xml` 7. **Last Week in AI** - `https://lastweekin.ai/feed` 8. **Analytics India Magazine** - `https://analyticsindiamag.com/feed/` 9. **AI News** - `https://www.artificialintelligence-news.com/feed/rss/` 10. **KDnuggets** - `https://www.kdnuggets.com/feed` (Full list of 22 sources in `SCRAPER-IMPROVEMENT-PLAN.md`) --- ## 🤖 Automatic Monitoring **I've set up automatic health checks:** - **Heartbeat monitoring** (every 2 hours) - Runs `scripts/check-pipeline-health.sh` - Alerts if: zero articles, high errors, or stale pipeline - **Daily checklist** (9 AM Singapore time) - Check overnight cron results - Review errors - Update task tracker - Report status **You'll be notified automatically if:** - Pipeline fails - Article count drops below 10 - Error rate exceeds 50 - No run in 36+ hours --- ## 💬 Questions to Decide 1. **Should I start building `scraper_v2.py` now?** - Or do you want to review the plan first? 2. **Do you want to add all 22 sources at once, or gradually?** - Recommendation: Start with top 10, then expand 3. **Should I deploy the fix automatically or ask first?** - Recommendation: Test first, then ask before deploying 4. **Priority: Speed or perfection?** - Option A: Quick fix (2-4 hours, basic functionality) - Option B: Proper rebuild (1-2 days, all optimizations) --- ## 📞 Contact **Owner:** Zeya Phyo **Developer:** Bob **Deadline:** ASAP (ideally today) **Current time:** 2026-02-26 08:30 UTC (4:30 PM Singapore) --- ## 🚀 Ready to Start? **Recommended action:** Let me start building `scraper_v2.py` now. **Command to kick off:** ``` Yes, start fixing the scraper now ``` Or if you want to review the plan first: ``` Show me the technical details of scraper_v2.py first ``` **All planning documents are ready. Just need your go-ahead to execute. 🎯**