burmddit/NEXT-STEPS.md

# 🚀 Burmddit: Next Steps (START HERE)

**Created:** 2026-02-26
**Priority:** 🔥 CRITICAL
**Status:** Action Required

---

## 🎯 The Problem

**burmddit.com is broken:**
- ❌ 0 articles scraped in the last 5 days
- ❌ Stuck at 87 articles (last update: Feb 21)
- ❌ All 8 news sources failing
- ❌ Pipeline runs daily but produces nothing

**Root cause:** `newspaper3k` library failures + scraping errors

---

## ✅ What I've Done (Last 30 minutes)

### 1. Research & Analysis
- ✅ Identified all scraper errors from logs
- ✅ Researched 100+ AI news RSS feeds
- ✅ Found 22 high-quality new sources to add

### 2. Planning Documents Created
- ✅ `SCRAPER-IMPROVEMENT-PLAN.md` - Detailed technical plan
- ✅ `BURMDDIT-TASKS.md` - Day-by-day task tracker
- ✅ `NEXT-STEPS.md` - This file (action plan)

### 3. Monitoring Scripts Created
- ✅ `scripts/check-pipeline-health.sh` - Quick health check
- ✅ `scripts/source-stats.py` - Source performance stats
- ✅ Updated `HEARTBEAT.md` - Auto-monitoring every 2 hours

---

## 🔥 What Needs to Happen Next (Priority Order)

### TODAY (Next 4 hours)

**1. Install dependencies** (5 min)
```bash
cd /home/ubuntu/.openclaw/workspace/burmddit/backend
pip3 install trafilatura readability-lxml fake-useragent lxml_html_clean
```

**2. Create improved scraper** (2 hours)
- File: `backend/scraper_v2.py`
- Features:
  - Multi-method extraction (newspaper → trafilatura → beautifulsoup)
  - User agent rotation
  - Better error handling
  - Retry logic with exponential backoff

**3. Test individual sources** (1 hour)
- Create `test_source.py` script
- Test each of 8 existing sources
- Identify which ones work

**4. Update config** (10 min)
- Disable broken sources
- Keep only working ones

**5. Test run** (90 min)
```bash
cd /home/ubuntu/.openclaw/workspace/burmddit/backend
python3 run_pipeline.py
```
- Target: At least 10 articles scraped
- If successful → deploy for tomorrow's cron

### TOMORROW (Day 2)

**Morning:**
- Check overnight cron results
- Fix any new errors

**Afternoon:**
- Add 5 high-priority new sources:
  - OpenAI Blog
  - Anthropic Blog
  - Hugging Face Blog
  - Google AI Blog
  - MarkTechPost
- Test evening run (target: 25+ articles)

### DAY 3

- Add remaining 17 new sources (30 total)
- Full test with all sources
- Verify monitoring works

### DAYS 4-7 (If time permits)

- Parallel scraping (reduce runtime 90min → 40min)
- Source health scoring
- Image extraction improvements
- Translation quality enhancements

---

## 📋 Key Files to Review

### Planning Docs
1. **`SCRAPER-IMPROVEMENT-PLAN.md`** - Full technical plan
   - Current issues explained
   - 22 new RSS sources listed
   - Implementation details
   - Success metrics

2. **`BURMDDIT-TASKS.md`** - Task tracker
   - Day-by-day breakdown
   - Checkboxes for tracking progress
   - Daily checklist
   - Success criteria

### Code Files (To Be Created)
1. `backend/scraper_v2.py` - New scraper (URGENT)
2. `backend/test_source.py` - Source tester
3. `scripts/check-pipeline-health.sh` - Health monitor ✅ (done)
4. `scripts/source-stats.py` - Stats reporter ✅ (done)

### Config Files
1. `backend/config.py` - Source configuration
2. `backend/.env` - Environment variables (API keys)

---

## 🎯 Success Criteria

### Immediate (Today)
- ✅ At least 10 articles scraped in test run
- ✅ At least 3 sources working
- ✅ Pipeline completes without crashing

### Day 3
- ✅ 30+ sources configured
- ✅ 40+ articles scraped per run
- ✅ <5% error rate

### Week 1
- ✅ 30-40 articles published daily
- ✅ 25/30 sources active
- ✅ 95%+ pipeline success rate
- ✅ Automatic monitoring working

---

## 🚨 Critical Path

**BLOCKER:** Scraper must be fixed TODAY for tomorrow's 1 AM UTC cron run.

**Timeline:**
- Now → +2h: Build `scraper_v2.py`
- +2h → +3h: Test sources
- +3h → +4.5h: Full pipeline test
- +4.5h: Deploy if successful

If delayed, website stays broken for another day = lost traffic.

---

## 📊 New Sources to Add (Top 10)

These are the highest-quality sources to prioritize:

1. **OpenAI Blog** - `https://openai.com/blog/rss/`
2. **Anthropic Blog** - `https://www.anthropic.com/rss`
3. **Hugging Face** - `https://huggingface.co/blog/feed.xml`
4. **Google AI** - `http://googleaiblog.blogspot.com/atom.xml`
5. **MarkTechPost** - `https://www.marktechpost.com/feed/`
6. **The Rundown AI** - `https://rss.beehiiv.com/feeds/2R3C6Bt5wj.xml`
7. **Last Week in AI** - `https://lastweekin.ai/feed`
8. **Analytics India Magazine** - `https://analyticsindiamag.com/feed/`
9. **AI News** - `https://www.artificialintelligence-news.com/feed/rss/`
10. **KDnuggets** - `https://www.kdnuggets.com/feed`

(Full list of 22 sources in `SCRAPER-IMPROVEMENT-PLAN.md`)

---

## 🤖 Automatic Monitoring

**I've set up automatic health checks:**

- **Heartbeat monitoring** (every 2 hours)
  - Runs `scripts/check-pipeline-health.sh`
  - Alerts if: zero articles, high errors, or stale pipeline

- **Daily checklist** (9 AM Singapore time)
  - Check overnight cron results
  - Review errors
  - Update task tracker
  - Report status

**You'll be notified automatically if:**
- Pipeline fails
- Article count drops below 10
- Error rate exceeds 50
- No run in 36+ hours

---

## 💬 Questions to Decide

1. **Should I start building `scraper_v2.py` now?**
   - Or do you want to review the plan first?

2. **Do you want to add all 22 sources at once, or gradually?**
   - Recommendation: Start with top 10, then expand

3. **Should I deploy the fix automatically or ask first?**
   - Recommendation: Test first, then ask before deploying

4. **Priority: Speed or perfection?**
   - Option A: Quick fix (2-4 hours, basic functionality)
   - Option B: Proper rebuild (1-2 days, all optimizations)

---

## 📞 Contact

**Owner:** Zeya Phyo
**Developer:** Bob
**Deadline:** ASAP (ideally today)

**Current time:** 2026-02-26 08:30 UTC (4:30 PM Singapore)

---

## 🚀 Ready to Start?

**Recommended action:** Let me start building `scraper_v2.py` now.

**Command to kick off:**
```
Yes, start fixing the scraper now
```

Or if you want to review the plan first:
```
Show me the technical details of scraper_v2.py first
```

**All planning documents are ready. Just need your go-ahead to execute. 🎯**