# Burmddit Web Scraper Improvement Plan **Date:** 2026-02-26 **Status:** 🚧 In Progress **Goal:** Fix scraper errors & expand to 30+ reliable AI news sources --- ## 📊 Current Status ### Issues Identified **Pipeline Status:** - ✅ Running daily at 1:00 AM UTC (9 AM Singapore) - ❌ **0 articles scraped** since Feb 21 - 📉 Stuck at 87 articles total - ⏰ Last successful run: Feb 21, 2026 **Scraper Errors:** 1. **newspaper3k library failures:** - `You must download() an article first!` - Affects: ArsTechnica, other sources 2. **Python exceptions:** - `'set' object is not subscriptable` - Affects: HackerNews, various sources 3. **Network errors:** - 403 Forbidden responses - Sites blocking bot user agents ### Current Sources (8) 1. ✅ Medium (8 AI tags) 2. ❌ TechCrunch AI 3. ❌ VentureBeat AI 4. ❌ MIT Tech Review 5. ❌ The Verge AI 6. ❌ Wired AI 7. ❌ Ars Technica 8. ❌ Hacker News --- ## 🎯 Goals ### Phase 1: Fix Existing Scraper (Week 1) - [ ] Debug and fix `newspaper3k` errors - [ ] Implement fallback scraping methods - [ ] Add error handling and retries - [ ] Test all 8 existing sources ### Phase 2: Expand Sources (Week 2) - [ ] Add 22 new RSS feeds - [ ] Test each source individually - [ ] Implement source health monitoring - [ ] Balance scraping load ### Phase 3: Improve Pipeline (Week 3) - [ ] Optimize article clustering - [ ] Improve translation quality - [ ] Add automatic health checks - [ ] Set up alerts for failures --- ## 🔧 Technical Improvements ### 1. Replace newspaper3k **Problem:** Unreliable, outdated library **Solution:** Multi-layer scraping approach ```python # Priority order: 1. Try newspaper3k (fast, but unreliable) 2. Fallback to BeautifulSoup + trafilatura (more reliable) 3. Fallback to requests + custom extractors 4. Skip article if all methods fail ``` ### 2. Better Error Handling ```python def scrape_with_fallback(url: str) -> Optional[Dict]: """Try multiple extraction methods""" methods = [ extract_with_newspaper, extract_with_trafilatura, extract_with_beautifulsoup, ] for method in methods: try: article = method(url) if article and len(article['content']) > 500: return article except Exception as e: logger.debug(f"{method.__name__} failed: {e}") continue logger.warning(f"All methods failed for {url}") return None ``` ### 3. Rate Limiting & Headers ```python # Better user agent rotation USER_AGENTS = [ 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36', # ... more agents ] # Respectful scraping RATE_LIMITS = { 'requests_per_domain': 10, # max per domain per run 'delay_between_requests': 3, # seconds 'timeout': 15, # seconds 'max_retries': 2 } ``` ### 4. Health Monitoring Create `monitor-pipeline.sh`: ```bash #!/bin/bash # Check if pipeline is healthy LATEST_LOG=$(ls -t /home/ubuntu/.openclaw/workspace/burmddit/logs/pipeline-*.log | head -1) ARTICLES_SCRAPED=$(grep "Total articles scraped:" "$LATEST_LOG" | tail -1 | grep -oP '\d+') if [ "$ARTICLES_SCRAPED" -lt 10 ]; then echo "⚠️ WARNING: Only $ARTICLES_SCRAPED articles scraped!" echo "Check logs: $LATEST_LOG" exit 1 fi echo "✅ Pipeline healthy: $ARTICLES_SCRAPED articles scraped" ``` --- ## 📰 New RSS Feed Sources (22 Added) ### Top Priority (10 sources) 1. **OpenAI Blog** - URL: `https://openai.com/blog/rss/` - Quality: 🔥🔥🔥 (Official source) 2. **Anthropic Blog** - URL: `https://www.anthropic.com/rss` - Quality: 🔥🔥🔥 3. **Hugging Face Blog** - URL: `https://huggingface.co/blog/feed.xml` - Quality: 🔥🔥🔥 4. **Google AI Blog** - URL: `http://googleaiblog.blogspot.com/atom.xml` - Quality: 🔥🔥🔥 5. **The Rundown AI** - URL: `https://rss.beehiiv.com/feeds/2R3C6Bt5wj.xml` - Quality: 🔥🔥 (Daily newsletter) 6. **Last Week in AI** - URL: `https://lastweekin.ai/feed` - Quality: 🔥🔥 (Weekly summary) 7. **MarkTechPost** - URL: `https://www.marktechpost.com/feed/` - Quality: 🔥🔥 (Daily AI news) 8. **Analytics India Magazine** - URL: `https://analyticsindiamag.com/feed/` - Quality: 🔥 (Multiple daily posts) 9. **AI News (AINews.com)** - URL: `https://www.artificialintelligence-news.com/feed/rss/` - Quality: 🔥🔥 10. **KDnuggets** - URL: `https://www.kdnuggets.com/feed` - Quality: 🔥🔥 (ML/AI tutorials) ### Secondary Sources (12 sources) 11. **Latent Space** - URL: `https://www.latent.space/feed` 12. **The Gradient** - URL: `https://thegradient.pub/rss/` 13. **The Algorithmic Bridge** - URL: `https://thealgorithmicbridge.substack.com/feed` 14. **Simon Willison's Weblog** - URL: `https://simonwillison.net/atom/everything/` 15. **Interconnects** - URL: `https://www.interconnects.ai/feed` 16. **THE DECODER** - URL: `https://the-decoder.com/feed/` 17. **AI Business** - URL: `https://aibusiness.com/rss.xml` 18. **Unite.AI** - URL: `https://www.unite.ai/feed/` 19. **ScienceDaily AI** - URL: `https://www.sciencedaily.com/rss/computers_math/artificial_intelligence.xml` 20. **The Guardian AI** - URL: `https://www.theguardian.com/technology/artificialintelligenceai/rss` 21. **Reuters Technology** - URL: `https://www.reutersagency.com/feed/?best-topics=tech` 22. **IEEE Spectrum AI** - URL: `https://spectrum.ieee.org/feeds/topic/artificial-intelligence.rss` --- ## 📋 Implementation Tasks ### Phase 1: Emergency Fixes (Days 1-3) - [ ] **Task 1.1:** Install `trafilatura` library ```bash cd /home/ubuntu/.openclaw/workspace/burmddit/backend pip3 install trafilatura readability-lxml ``` - [ ] **Task 1.2:** Create new `scraper_v2.py` with fallback methods - [ ] Implement multi-method extraction - [ ] Add user agent rotation - [ ] Better error handling - [ ] Retry logic with exponential backoff - [ ] **Task 1.3:** Test each existing source manually - [ ] Medium - [ ] TechCrunch - [ ] VentureBeat - [ ] MIT Tech Review - [ ] The Verge - [ ] Wired - [ ] Ars Technica - [ ] Hacker News - [ ] **Task 1.4:** Update `config.py` with working sources only - [ ] **Task 1.5:** Run test pipeline ```bash cd /home/ubuntu/.openclaw/workspace/burmddit/backend python3 run_pipeline.py ``` ### Phase 2: Add New Sources (Days 4-7) - [ ] **Task 2.1:** Update `config.py` with 22 new RSS feeds - [ ] **Task 2.2:** Test each new source individually - [ ] Create `test_source.py` script - [ ] Verify article quality - [ ] Check extraction success rate - [ ] **Task 2.3:** Categorize sources by reliability - [ ] Tier 1: Official blogs (OpenAI, Anthropic, Google) - [ ] Tier 2: News sites (TechCrunch, Verge) - [ ] Tier 3: Aggregators (Reddit, HN) - [ ] **Task 2.4:** Implement source health scoring ```python # Track success rates per source source_health = { 'openai': {'attempts': 100, 'success': 98, 'score': 0.98}, 'medium': {'attempts': 100, 'success': 45, 'score': 0.45}, } ``` - [ ] **Task 2.5:** Auto-disable sources with <30% success rate ### Phase 3: Monitoring & Alerts (Days 8-10) - [ ] **Task 3.1:** Create `monitor-pipeline.sh` - [ ] Check articles scraped > 10 - [ ] Check pipeline runtime < 120 minutes - [ ] Check latest article age < 24 hours - [ ] **Task 3.2:** Set up heartbeat monitoring - [ ] Add to `HEARTBEAT.md` - [ ] Alert if pipeline fails 2 days in a row - [ ] **Task 3.3:** Create weekly health report cron job ```python # Weekly report: source stats, article counts, error rates ``` - [ ] **Task 3.4:** Dashboard for source health - [ ] Show last 7 days of scraping stats - [ ] Success rates per source - [ ] Articles published per day ### Phase 4: Optimization (Days 11-14) - [ ] **Task 4.1:** Parallel scraping - [ ] Use `asyncio` or `multiprocessing` - [ ] Reduce pipeline time from 90min → 30min - [ ] **Task 4.2:** Smart article selection - [ ] Prioritize trending topics - [ ] Avoid duplicate content - [ ] Better topic clustering - [ ] **Task 4.3:** Image extraction improvements - [ ] Better image quality filtering - [ ] Fallback to AI-generated images - [ ] Optimize image loading - [ ] **Task 4.4:** Translation quality improvements - [ ] A/B test different Claude prompts - [ ] Add human review for top articles - [ ] Build glossary of technical terms --- ## 🔔 Monitoring Setup ### Daily Checks (via Heartbeat) Add to `HEARTBEAT.md`: ```markdown ## Burmddit Pipeline Health **Check every 2nd heartbeat (every ~1 hour):** 1. Run: `/home/ubuntu/.openclaw/workspace/burmddit/scripts/check-pipeline-health.sh` 2. If articles_scraped < 10: Alert immediately 3. If pipeline failed: Check logs and report error ``` ### Weekly Report (via Cron) Already set up! Runs Wednesdays at 9 AM. --- ## 📈 Success Metrics ### Week 1 Targets - ✅ 0 → 30+ articles scraped per day - ✅ At least 5/8 existing sources working - ✅ Pipeline completion success rate >80% ### Week 2 Targets - ✅ 30 total sources active - ✅ 50+ articles scraped per day - ✅ Source health monitoring active ### Week 3 Targets - ✅ 30-40 articles published per day - ✅ Auto-recovery from errors - ✅ Weekly reports sent automatically ### Month 1 Goals - 🎯 1,200+ articles published (40/day avg) - 🎯 Google AdSense eligible (1000+ articles) - 🎯 10,000+ page views/month --- ## 🚨 Immediate Actions (Today) 1. **Install dependencies:** ```bash pip3 install trafilatura readability-lxml fake-useragent ``` 2. **Create scraper_v2.py** (see next file) 3. **Test manual scrape:** ```bash python3 test_scraper.py --source openai --limit 5 ``` 4. **Fix and deploy by tomorrow morning** (before 1 AM UTC run) --- ## 📁 New Files to Create 1. `/backend/scraper_v2.py` - Improved scraper 2. `/backend/test_scraper.py` - Individual source tester 3. `/scripts/monitor-pipeline.sh` - Health check script 4. `/scripts/check-pipeline-health.sh` - Quick status check 5. `/scripts/source-health-report.py` - Weekly stats --- **Next Step:** Create `scraper_v2.py` with robust fallback methods