burmddit/SCRAPER-IMPROVEMENT-PLAN.md

# Burmddit Web Scraper Improvement Plan

**Date:** 2026-02-26
**Status:** 🚧 In Progress
**Goal:** Fix scraper errors & expand to 30+ reliable AI news sources

---

## 📊 Current Status

### Issues Identified

**Pipeline Status:**
- ✅ Running daily at 1:00 AM UTC (9 AM Singapore)
- ❌ **0 articles scraped** since Feb 21
- 📉 Stuck at 87 articles total
- ⏰ Last successful run: Feb 21, 2026

**Scraper Errors:**

1. **newspaper3k library failures:**
   - `You must download() an article first!`
   - Affects: ArsTechnica, other sources

2. **Python exceptions:**
   - `'set' object is not subscriptable`
   - Affects: HackerNews, various sources

3. **Network errors:**
   - 403 Forbidden responses
   - Sites blocking bot user agents

### Current Sources (8)

1. ✅ Medium (8 AI tags)
2. ❌ TechCrunch AI
3. ❌ VentureBeat AI
4. ❌ MIT Tech Review
5. ❌ The Verge AI
6. ❌ Wired AI
7. ❌ Ars Technica
8. ❌ Hacker News

---

## 🎯 Goals

### Phase 1: Fix Existing Scraper (Week 1)
- [ ] Debug and fix `newspaper3k` errors
- [ ] Implement fallback scraping methods
- [ ] Add error handling and retries
- [ ] Test all 8 existing sources

### Phase 2: Expand Sources (Week 2)
- [ ] Add 22 new RSS feeds
- [ ] Test each source individually
- [ ] Implement source health monitoring
- [ ] Balance scraping load

### Phase 3: Improve Pipeline (Week 3)
- [ ] Optimize article clustering
- [ ] Improve translation quality
- [ ] Add automatic health checks
- [ ] Set up alerts for failures

---

## 🔧 Technical Improvements

### 1. Replace newspaper3k

**Problem:** Unreliable, outdated library

**Solution:** Multi-layer scraping approach

```python
# Priority order:
1. Try newspaper3k (fast, but unreliable)
2. Fallback to BeautifulSoup + trafilatura (more reliable)
3. Fallback to requests + custom extractors
4. Skip article if all methods fail
```

### 2. Better Error Handling

```python
def scrape_with_fallback(url: str) -> Optional[Dict]:
    """Try multiple extraction methods"""
    methods = [
        extract_with_newspaper,
        extract_with_trafilatura,
        extract_with_beautifulsoup,
    ]

    for method in methods:
        try:
            article = method(url)
            if article and len(article['content']) > 500:
                return article
        except Exception as e:
            logger.debug(f"{method.__name__} failed: {e}")
            continue

    logger.warning(f"All methods failed for {url}")
    return None
```

### 3. Rate Limiting & Headers

```python
# Better user agent rotation
USER_AGENTS = [
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36',
    # ... more agents
]

# Respectful scraping
RATE_LIMITS = {
    'requests_per_domain': 10,  # max per domain per run
    'delay_between_requests': 3,  # seconds
    'timeout': 15,  # seconds
    'max_retries': 2
}
```

### 4. Health Monitoring

Create `monitor-pipeline.sh`:

```bash
#!/bin/bash
# Check if pipeline is healthy

LATEST_LOG=$(ls -t /home/ubuntu/.openclaw/workspace/burmddit/logs/pipeline-*.log | head -1)
ARTICLES_SCRAPED=$(grep "Total articles scraped:" "$LATEST_LOG" | tail -1 | grep -oP '\d+')

if [ "$ARTICLES_SCRAPED" -lt 10 ]; then
    echo "⚠️ WARNING: Only $ARTICLES_SCRAPED articles scraped!"
    echo "Check logs: $LATEST_LOG"
    exit 1
fi

echo "✅ Pipeline healthy: $ARTICLES_SCRAPED articles scraped"
```

---

## 📰 New RSS Feed Sources (22 Added)

### Top Priority (10 sources)

1. **OpenAI Blog**
   - URL: `https://openai.com/blog/rss/`
   - Quality: 🔥🔥🔥 (Official source)

2. **Anthropic Blog**
   - URL: `https://www.anthropic.com/rss`
   - Quality: 🔥🔥🔥

3. **Hugging Face Blog**
   - URL: `https://huggingface.co/blog/feed.xml`
   - Quality: 🔥🔥🔥

4. **Google AI Blog**
   - URL: `http://googleaiblog.blogspot.com/atom.xml`
   - Quality: 🔥🔥🔥

5. **The Rundown AI**
   - URL: `https://rss.beehiiv.com/feeds/2R3C6Bt5wj.xml`
   - Quality: 🔥🔥 (Daily newsletter)

6. **Last Week in AI**
   - URL: `https://lastweekin.ai/feed`
   - Quality: 🔥🔥 (Weekly summary)

7. **MarkTechPost**
   - URL: `https://www.marktechpost.com/feed/`
   - Quality: 🔥🔥 (Daily AI news)

8. **Analytics India Magazine**
   - URL: `https://analyticsindiamag.com/feed/`
   - Quality: 🔥 (Multiple daily posts)

9. **AI News (AINews.com)**
   - URL: `https://www.artificialintelligence-news.com/feed/rss/`
   - Quality: 🔥🔥

10. **KDnuggets**
    - URL: `https://www.kdnuggets.com/feed`
    - Quality: 🔥🔥 (ML/AI tutorials)

### Secondary Sources (12 sources)

11. **Latent Space**
    - URL: `https://www.latent.space/feed`

12. **The Gradient**
    - URL: `https://thegradient.pub/rss/`

13. **The Algorithmic Bridge**
    - URL: `https://thealgorithmicbridge.substack.com/feed`

14. **Simon Willison's Weblog**
    - URL: `https://simonwillison.net/atom/everything/`

15. **Interconnects**
    - URL: `https://www.interconnects.ai/feed`

16. **THE DECODER**
    - URL: `https://the-decoder.com/feed/`

17. **AI Business**
    - URL: `https://aibusiness.com/rss.xml`

18. **Unite.AI**
    - URL: `https://www.unite.ai/feed/`

19. **ScienceDaily AI**
    - URL: `https://www.sciencedaily.com/rss/computers_math/artificial_intelligence.xml`

20. **The Guardian AI**
    - URL: `https://www.theguardian.com/technology/artificialintelligenceai/rss`

21. **Reuters Technology**
    - URL: `https://www.reutersagency.com/feed/?best-topics=tech`

22. **IEEE Spectrum AI**
    - URL: `https://spectrum.ieee.org/feeds/topic/artificial-intelligence.rss`

---

## 📋 Implementation Tasks

### Phase 1: Emergency Fixes (Days 1-3)

- [ ] **Task 1.1:** Install `trafilatura` library
  ```bash
  cd /home/ubuntu/.openclaw/workspace/burmddit/backend
  pip3 install trafilatura readability-lxml
  ```

- [ ] **Task 1.2:** Create new `scraper_v2.py` with fallback methods
  - [ ] Implement multi-method extraction
  - [ ] Add user agent rotation
  - [ ] Better error handling
  - [ ] Retry logic with exponential backoff

- [ ] **Task 1.3:** Test each existing source manually
  - [ ] Medium
  - [ ] TechCrunch
  - [ ] VentureBeat
  - [ ] MIT Tech Review
  - [ ] The Verge
  - [ ] Wired
  - [ ] Ars Technica
  - [ ] Hacker News

- [ ] **Task 1.4:** Update `config.py` with working sources only

- [ ] **Task 1.5:** Run test pipeline
  ```bash
  cd /home/ubuntu/.openclaw/workspace/burmddit/backend
  python3 run_pipeline.py
  ```

### Phase 2: Add New Sources (Days 4-7)

- [ ] **Task 2.1:** Update `config.py` with 22 new RSS feeds

- [ ] **Task 2.2:** Test each new source individually
  - [ ] Create `test_source.py` script
  - [ ] Verify article quality
  - [ ] Check extraction success rate

- [ ] **Task 2.3:** Categorize sources by reliability
  - [ ] Tier 1: Official blogs (OpenAI, Anthropic, Google)
  - [ ] Tier 2: News sites (TechCrunch, Verge)
  - [ ] Tier 3: Aggregators (Reddit, HN)

- [ ] **Task 2.4:** Implement source health scoring
  ```python
  # Track success rates per source
  source_health = {
      'openai': {'attempts': 100, 'success': 98, 'score': 0.98},
      'medium': {'attempts': 100, 'success': 45, 'score': 0.45},
  }
  ```

- [ ] **Task 2.5:** Auto-disable sources with <30% success rate

### Phase 3: Monitoring & Alerts (Days 8-10)

- [ ] **Task 3.1:** Create `monitor-pipeline.sh`
  - [ ] Check articles scraped > 10
  - [ ] Check pipeline runtime < 120 minutes
  - [ ] Check latest article age < 24 hours

- [ ] **Task 3.2:** Set up heartbeat monitoring
  - [ ] Add to `HEARTBEAT.md`
  - [ ] Alert if pipeline fails 2 days in a row

- [ ] **Task 3.3:** Create weekly health report cron job
  ```python
  # Weekly report: source stats, article counts, error rates
  ```

- [ ] **Task 3.4:** Dashboard for source health
  - [ ] Show last 7 days of scraping stats
  - [ ] Success rates per source
  - [ ] Articles published per day

### Phase 4: Optimization (Days 11-14)

- [ ] **Task 4.1:** Parallel scraping
  - [ ] Use `asyncio` or `multiprocessing`
  - [ ] Reduce pipeline time from 90min → 30min

- [ ] **Task 4.2:** Smart article selection
  - [ ] Prioritize trending topics
  - [ ] Avoid duplicate content
  - [ ] Better topic clustering

- [ ] **Task 4.3:** Image extraction improvements
  - [ ] Better image quality filtering
  - [ ] Fallback to AI-generated images
  - [ ] Optimize image loading

- [ ] **Task 4.4:** Translation quality improvements
  - [ ] A/B test different Claude prompts
  - [ ] Add human review for top articles
  - [ ] Build glossary of technical terms

---

## 🔔 Monitoring Setup

### Daily Checks (via Heartbeat)

Add to `HEARTBEAT.md`:

```markdown
## Burmddit Pipeline Health

**Check every 2nd heartbeat (every ~1 hour):**

1. Run: `/home/ubuntu/.openclaw/workspace/burmddit/scripts/check-pipeline-health.sh`
2. If articles_scraped < 10: Alert immediately
3. If pipeline failed: Check logs and report error
```

### Weekly Report (via Cron)

Already set up! Runs Wednesdays at 9 AM.

---

## 📈 Success Metrics

### Week 1 Targets
- ✅ 0 → 30+ articles scraped per day
- ✅ At least 5/8 existing sources working
- ✅ Pipeline completion success rate >80%

### Week 2 Targets
- ✅ 30 total sources active
- ✅ 50+ articles scraped per day
- ✅ Source health monitoring active

### Week 3 Targets
- ✅ 30-40 articles published per day
- ✅ Auto-recovery from errors
- ✅ Weekly reports sent automatically

### Month 1 Goals
- 🎯 1,200+ articles published (40/day avg)
- 🎯 Google AdSense eligible (1000+ articles)
- 🎯 10,000+ page views/month

---

## 🚨 Immediate Actions (Today)

1. **Install dependencies:**
   ```bash
   pip3 install trafilatura readability-lxml fake-useragent
   ```

2. **Create scraper_v2.py** (see next file)

3. **Test manual scrape:**
   ```bash
   python3 test_scraper.py --source openai --limit 5
   ```

4. **Fix and deploy by tomorrow morning** (before 1 AM UTC run)

---

## 📁 New Files to Create

1. `/backend/scraper_v2.py` - Improved scraper
2. `/backend/test_scraper.py` - Individual source tester
3. `/scripts/monitor-pipeline.sh` - Health check script
4. `/scripts/check-pipeline-health.sh` - Quick status check
5. `/scripts/source-health-report.py` - Weekly stats

---

**Next Step:** Create `scraper_v2.py` with robust fallback methods