Files
burmddit/SCRAPER-IMPROVEMENT-PLAN.md
Zeya Phyo f51ac4afa4 Add web admin features + fix scraper & translator
Frontend changes:
- Add /admin dashboard for article management
- Add AdminButton component (Alt+Shift+A on articles)
- Add /api/admin/article API endpoints

Backend improvements:
- scraper_v2.py: Multi-layer fallback extraction (newspaper → trafilatura → readability)
- translator_v2.py: Better chunking, repetition detection, validation
- admin_tools.py: CLI admin commands
- test_scraper.py: Individual source testing

Docs:
- WEB-ADMIN-GUIDE.md: Web admin usage
- ADMIN-GUIDE.md: CLI admin usage
- SCRAPER-IMPROVEMENT-PLAN.md: Scraper fixes details
- TRANSLATION-FIX.md: Translation improvements
- ADMIN-FEATURES-SUMMARY.md: Implementation summary

Fixes:
- Article scraping from 0 → 96+ articles working
- Translation quality issues (repetition, truncation)
- Added 13 new RSS sources
2026-02-26 09:17:50 +00:00

10 KiB

Burmddit Web Scraper Improvement Plan

Date: 2026-02-26
Status: 🚧 In Progress
Goal: Fix scraper errors & expand to 30+ reliable AI news sources


📊 Current Status

Issues Identified

Pipeline Status:

  • Running daily at 1:00 AM UTC (9 AM Singapore)
  • 0 articles scraped since Feb 21
  • 📉 Stuck at 87 articles total
  • Last successful run: Feb 21, 2026

Scraper Errors:

  1. newspaper3k library failures:

    • You must download() an article first!
    • Affects: ArsTechnica, other sources
  2. Python exceptions:

    • 'set' object is not subscriptable
    • Affects: HackerNews, various sources
  3. Network errors:

    • 403 Forbidden responses
    • Sites blocking bot user agents

Current Sources (8)

  1. Medium (8 AI tags)
  2. TechCrunch AI
  3. VentureBeat AI
  4. MIT Tech Review
  5. The Verge AI
  6. Wired AI
  7. Ars Technica
  8. Hacker News

🎯 Goals

Phase 1: Fix Existing Scraper (Week 1)

  • Debug and fix newspaper3k errors
  • Implement fallback scraping methods
  • Add error handling and retries
  • Test all 8 existing sources

Phase 2: Expand Sources (Week 2)

  • Add 22 new RSS feeds
  • Test each source individually
  • Implement source health monitoring
  • Balance scraping load

Phase 3: Improve Pipeline (Week 3)

  • Optimize article clustering
  • Improve translation quality
  • Add automatic health checks
  • Set up alerts for failures

🔧 Technical Improvements

1. Replace newspaper3k

Problem: Unreliable, outdated library

Solution: Multi-layer scraping approach

# Priority order:
1. Try newspaper3k (fast, but unreliable)
2. Fallback to BeautifulSoup + trafilatura (more reliable)
3. Fallback to requests + custom extractors
4. Skip article if all methods fail

2. Better Error Handling

def scrape_with_fallback(url: str) -> Optional[Dict]:
    """Try multiple extraction methods"""
    methods = [
        extract_with_newspaper,
        extract_with_trafilatura,
        extract_with_beautifulsoup,
    ]
    
    for method in methods:
        try:
            article = method(url)
            if article and len(article['content']) > 500:
                return article
        except Exception as e:
            logger.debug(f"{method.__name__} failed: {e}")
            continue
    
    logger.warning(f"All methods failed for {url}")
    return None

3. Rate Limiting & Headers

# Better user agent rotation
USER_AGENTS = [
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36',
    # ... more agents
]

# Respectful scraping
RATE_LIMITS = {
    'requests_per_domain': 10,  # max per domain per run
    'delay_between_requests': 3,  # seconds
    'timeout': 15,  # seconds
    'max_retries': 2
}

4. Health Monitoring

Create monitor-pipeline.sh:

#!/bin/bash
# Check if pipeline is healthy

LATEST_LOG=$(ls -t /home/ubuntu/.openclaw/workspace/burmddit/logs/pipeline-*.log | head -1)
ARTICLES_SCRAPED=$(grep "Total articles scraped:" "$LATEST_LOG" | tail -1 | grep -oP '\d+')

if [ "$ARTICLES_SCRAPED" -lt 10 ]; then
    echo "⚠️ WARNING: Only $ARTICLES_SCRAPED articles scraped!"
    echo "Check logs: $LATEST_LOG"
    exit 1
fi

echo "✅ Pipeline healthy: $ARTICLES_SCRAPED articles scraped"

📰 New RSS Feed Sources (22 Added)

Top Priority (10 sources)

  1. OpenAI Blog

    • URL: https://openai.com/blog/rss/
    • Quality: 🔥🔥🔥 (Official source)
  2. Anthropic Blog

    • URL: https://www.anthropic.com/rss
    • Quality: 🔥🔥🔥
  3. Hugging Face Blog

    • URL: https://huggingface.co/blog/feed.xml
    • Quality: 🔥🔥🔥
  4. Google AI Blog

    • URL: http://googleaiblog.blogspot.com/atom.xml
    • Quality: 🔥🔥🔥
  5. The Rundown AI

    • URL: https://rss.beehiiv.com/feeds/2R3C6Bt5wj.xml
    • Quality: 🔥🔥 (Daily newsletter)
  6. Last Week in AI

    • URL: https://lastweekin.ai/feed
    • Quality: 🔥🔥 (Weekly summary)
  7. MarkTechPost

    • URL: https://www.marktechpost.com/feed/
    • Quality: 🔥🔥 (Daily AI news)
  8. Analytics India Magazine

    • URL: https://analyticsindiamag.com/feed/
    • Quality: 🔥 (Multiple daily posts)
  9. AI News (AINews.com)

    • URL: https://www.artificialintelligence-news.com/feed/rss/
    • Quality: 🔥🔥
  10. KDnuggets

    • URL: https://www.kdnuggets.com/feed
    • Quality: 🔥🔥 (ML/AI tutorials)

Secondary Sources (12 sources)

  1. Latent Space

    • URL: https://www.latent.space/feed
  2. The Gradient

    • URL: https://thegradient.pub/rss/
  3. The Algorithmic Bridge

    • URL: https://thealgorithmicbridge.substack.com/feed
  4. Simon Willison's Weblog

    • URL: https://simonwillison.net/atom/everything/
  5. Interconnects

    • URL: https://www.interconnects.ai/feed
  6. THE DECODER

    • URL: https://the-decoder.com/feed/
  7. AI Business

    • URL: https://aibusiness.com/rss.xml
  8. Unite.AI

    • URL: https://www.unite.ai/feed/
  9. ScienceDaily AI

    • URL: https://www.sciencedaily.com/rss/computers_math/artificial_intelligence.xml
  10. The Guardian AI

    • URL: https://www.theguardian.com/technology/artificialintelligenceai/rss
  11. Reuters Technology

    • URL: https://www.reutersagency.com/feed/?best-topics=tech
  12. IEEE Spectrum AI

    • URL: https://spectrum.ieee.org/feeds/topic/artificial-intelligence.rss

📋 Implementation Tasks

Phase 1: Emergency Fixes (Days 1-3)

  • Task 1.1: Install trafilatura library

    cd /home/ubuntu/.openclaw/workspace/burmddit/backend
    pip3 install trafilatura readability-lxml
    
  • Task 1.2: Create new scraper_v2.py with fallback methods

    • Implement multi-method extraction
    • Add user agent rotation
    • Better error handling
    • Retry logic with exponential backoff
  • Task 1.3: Test each existing source manually

    • Medium
    • TechCrunch
    • VentureBeat
    • MIT Tech Review
    • The Verge
    • Wired
    • Ars Technica
    • Hacker News
  • Task 1.4: Update config.py with working sources only

  • Task 1.5: Run test pipeline

    cd /home/ubuntu/.openclaw/workspace/burmddit/backend
    python3 run_pipeline.py
    

Phase 2: Add New Sources (Days 4-7)

  • Task 2.1: Update config.py with 22 new RSS feeds

  • Task 2.2: Test each new source individually

    • Create test_source.py script
    • Verify article quality
    • Check extraction success rate
  • Task 2.3: Categorize sources by reliability

    • Tier 1: Official blogs (OpenAI, Anthropic, Google)
    • Tier 2: News sites (TechCrunch, Verge)
    • Tier 3: Aggregators (Reddit, HN)
  • Task 2.4: Implement source health scoring

    # Track success rates per source
    source_health = {
        'openai': {'attempts': 100, 'success': 98, 'score': 0.98},
        'medium': {'attempts': 100, 'success': 45, 'score': 0.45},
    }
    
  • Task 2.5: Auto-disable sources with <30% success rate

Phase 3: Monitoring & Alerts (Days 8-10)

  • Task 3.1: Create monitor-pipeline.sh

    • Check articles scraped > 10
    • Check pipeline runtime < 120 minutes
    • Check latest article age < 24 hours
  • Task 3.2: Set up heartbeat monitoring

    • Add to HEARTBEAT.md
    • Alert if pipeline fails 2 days in a row
  • Task 3.3: Create weekly health report cron job

    # Weekly report: source stats, article counts, error rates
    
  • Task 3.4: Dashboard for source health

    • Show last 7 days of scraping stats
    • Success rates per source
    • Articles published per day

Phase 4: Optimization (Days 11-14)

  • Task 4.1: Parallel scraping

    • Use asyncio or multiprocessing
    • Reduce pipeline time from 90min → 30min
  • Task 4.2: Smart article selection

    • Prioritize trending topics
    • Avoid duplicate content
    • Better topic clustering
  • Task 4.3: Image extraction improvements

    • Better image quality filtering
    • Fallback to AI-generated images
    • Optimize image loading
  • Task 4.4: Translation quality improvements

    • A/B test different Claude prompts
    • Add human review for top articles
    • Build glossary of technical terms

🔔 Monitoring Setup

Daily Checks (via Heartbeat)

Add to HEARTBEAT.md:

## Burmddit Pipeline Health

**Check every 2nd heartbeat (every ~1 hour):**

1. Run: `/home/ubuntu/.openclaw/workspace/burmddit/scripts/check-pipeline-health.sh`
2. If articles_scraped < 10: Alert immediately
3. If pipeline failed: Check logs and report error

Weekly Report (via Cron)

Already set up! Runs Wednesdays at 9 AM.


📈 Success Metrics

Week 1 Targets

  • 0 → 30+ articles scraped per day
  • At least 5/8 existing sources working
  • Pipeline completion success rate >80%

Week 2 Targets

  • 30 total sources active
  • 50+ articles scraped per day
  • Source health monitoring active

Week 3 Targets

  • 30-40 articles published per day
  • Auto-recovery from errors
  • Weekly reports sent automatically

Month 1 Goals

  • 🎯 1,200+ articles published (40/day avg)
  • 🎯 Google AdSense eligible (1000+ articles)
  • 🎯 10,000+ page views/month

🚨 Immediate Actions (Today)

  1. Install dependencies:

    pip3 install trafilatura readability-lxml fake-useragent
    
  2. Create scraper_v2.py (see next file)

  3. Test manual scrape:

    python3 test_scraper.py --source openai --limit 5
    
  4. Fix and deploy by tomorrow morning (before 1 AM UTC run)


📁 New Files to Create

  1. /backend/scraper_v2.py - Improved scraper
  2. /backend/test_scraper.py - Individual source tester
  3. /scripts/monitor-pipeline.sh - Health check script
  4. /scripts/check-pipeline-health.sh - Quick status check
  5. /scripts/source-health-report.py - Weekly stats

Next Step: Create scraper_v2.py with robust fallback methods