Files

Zeya Phyo f51ac4afa4 Add web admin features + fix scraper & translator

Frontend changes:
- Add /admin dashboard for article management
- Add AdminButton component (Alt+Shift+A on articles)
- Add /api/admin/article API endpoints

Backend improvements:
- scraper_v2.py: Multi-layer fallback extraction (newspaper → trafilatura → readability)
- translator_v2.py: Better chunking, repetition detection, validation
- admin_tools.py: CLI admin commands
- test_scraper.py: Individual source testing

Docs:
- WEB-ADMIN-GUIDE.md: Web admin usage
- ADMIN-GUIDE.md: CLI admin usage
- SCRAPER-IMPROVEMENT-PLAN.md: Scraper fixes details
- TRANSLATION-FIX.md: Translation improvements
- ADMIN-FEATURES-SUMMARY.md: Implementation summary

Fixes:
- Article scraping from 0 → 96+ articles working
- Translation quality issues (repetition, truncation)
- Added 13 new RSS sources

2026-02-26 09:17:50 +00:00

10 KiB

Raw Blame History

Burmddit Web Scraper Improvement Plan

Date: 2026-02-26
Status: 🚧 In Progress
Goal: Fix scraper errors & expand to 30+ reliable AI news sources

📊 Current Status

Issues Identified

Pipeline Status:

✅ Running daily at 1:00 AM UTC (9 AM Singapore)
❌ 0 articles scraped since Feb 21
📉 Stuck at 87 articles total
⏰ Last successful run: Feb 21, 2026

Scraper Errors:

newspaper3k library failures:
- You must download() an article first!
- Affects: ArsTechnica, other sources
Python exceptions:
- 'set' object is not subscriptable
- Affects: HackerNews, various sources
Network errors:
- 403 Forbidden responses
- Sites blocking bot user agents

Current Sources (8)

✅ Medium (8 AI tags)
❌ TechCrunch AI
❌ VentureBeat AI
❌ MIT Tech Review
❌ The Verge AI
❌ Wired AI
❌ Ars Technica
❌ Hacker News

🎯 Goals

Phase 1: Fix Existing Scraper (Week 1)

Debug and fix newspaper3k errors
Implement fallback scraping methods
Add error handling and retries
Test all 8 existing sources

Phase 2: Expand Sources (Week 2)

Add 22 new RSS feeds
Test each source individually
Implement source health monitoring
Balance scraping load

Phase 3: Improve Pipeline (Week 3)

Optimize article clustering
Improve translation quality
Add automatic health checks
Set up alerts for failures

🔧 Technical Improvements

1. Replace newspaper3k

Problem: Unreliable, outdated library

Solution: Multi-layer scraping approach

# Priority order:
1. Try newspaper3k (fast, but unreliable)
2. Fallback to BeautifulSoup + trafilatura (more reliable)
3. Fallback to requests + custom extractors
4. Skip article if all methods fail

2. Better Error Handling

def scrape_with_fallback(url: str) -> Optional[Dict]:
    """Try multiple extraction methods"""
    methods = [
        extract_with_newspaper,
        extract_with_trafilatura,
        extract_with_beautifulsoup,
    ]
    
    for method in methods:
        try:
            article = method(url)
            if article and len(article['content']) > 500:
                return article
        except Exception as e:
            logger.debug(f"{method.__name__} failed: {e}")
            continue
    
    logger.warning(f"All methods failed for {url}")
    return None

3. Rate Limiting & Headers

# Better user agent rotation
USER_AGENTS = [
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36',
    # ... more agents
]

# Respectful scraping
RATE_LIMITS = {
    'requests_per_domain': 10,  # max per domain per run
    'delay_between_requests': 3,  # seconds
    'timeout': 15,  # seconds
    'max_retries': 2
}

4. Health Monitoring

Create monitor-pipeline.sh:

#!/bin/bash
# Check if pipeline is healthy

LATEST_LOG=$(ls -t /home/ubuntu/.openclaw/workspace/burmddit/logs/pipeline-*.log | head -1)
ARTICLES_SCRAPED=$(grep "Total articles scraped:" "$LATEST_LOG" | tail -1 | grep -oP '\d+')

if [ "$ARTICLES_SCRAPED" -lt 10 ]; then
    echo "⚠️ WARNING: Only $ARTICLES_SCRAPED articles scraped!"
    echo "Check logs: $LATEST_LOG"
    exit 1
fi

echo "✅ Pipeline healthy: $ARTICLES_SCRAPED articles scraped"

📰 New RSS Feed Sources (22 Added)

Top Priority (10 sources)

OpenAI Blog
- URL: https://openai.com/blog/rss/
- Quality: 🔥🔥🔥 (Official source)
Anthropic Blog
- URL: https://www.anthropic.com/rss
- Quality: 🔥🔥🔥
Hugging Face Blog
- URL: https://huggingface.co/blog/feed.xml
- Quality: 🔥🔥🔥
Google AI Blog
- URL: http://googleaiblog.blogspot.com/atom.xml
- Quality: 🔥🔥🔥
The Rundown AI
- URL: https://rss.beehiiv.com/feeds/2R3C6Bt5wj.xml
- Quality: 🔥🔥 (Daily newsletter)
Last Week in AI
- URL: https://lastweekin.ai/feed
- Quality: 🔥🔥 (Weekly summary)
MarkTechPost
- URL: https://www.marktechpost.com/feed/
- Quality: 🔥🔥 (Daily AI news)
Analytics India Magazine
- URL: https://analyticsindiamag.com/feed/
- Quality: 🔥 (Multiple daily posts)
AI News (AINews.com)
- URL: https://www.artificialintelligence-news.com/feed/rss/
- Quality: 🔥🔥
KDnuggets
- URL: https://www.kdnuggets.com/feed
- Quality: 🔥🔥 (ML/AI tutorials)

Secondary Sources (12 sources)

Latent Space
- URL: https://www.latent.space/feed
The Gradient
- URL: https://thegradient.pub/rss/
The Algorithmic Bridge
- URL: https://thealgorithmicbridge.substack.com/feed
Simon Willison's Weblog
- URL: https://simonwillison.net/atom/everything/
Interconnects
- URL: https://www.interconnects.ai/feed
THE DECODER
- URL: https://the-decoder.com/feed/
AI Business
- URL: https://aibusiness.com/rss.xml
Unite.AI
- URL: https://www.unite.ai/feed/
ScienceDaily AI
- URL: https://www.sciencedaily.com/rss/computers_math/artificial_intelligence.xml
The Guardian AI
- URL: https://www.theguardian.com/technology/artificialintelligenceai/rss
Reuters Technology
- URL: https://www.reutersagency.com/feed/?best-topics=tech
IEEE Spectrum AI
- URL: https://spectrum.ieee.org/feeds/topic/artificial-intelligence.rss

📋 Implementation Tasks

Phase 1: Emergency Fixes (Days 1-3)

Task 1.1: Install trafilatura library

cd /home/ubuntu/.openclaw/workspace/burmddit/backend
pip3 install trafilatura readability-lxml

Task 1.2: Create new scraper_v2.py with fallback methods
- Implement multi-method extraction
- Add user agent rotation
- Better error handling
- Retry logic with exponential backoff
Task 1.3: Test each existing source manually
- Medium
- TechCrunch
- VentureBeat
- MIT Tech Review
- The Verge
- Wired
- Ars Technica
- Hacker News
Task 1.4: Update config.py with working sources only

Task 1.5: Run test pipeline

cd /home/ubuntu/.openclaw/workspace/burmddit/backend
python3 run_pipeline.py

Phase 2: Add New Sources (Days 4-7)

Task 2.1: Update config.py with 22 new RSS feeds
Task 2.2: Test each new source individually
- Create test_source.py script
- Verify article quality
- Check extraction success rate
Task 2.3: Categorize sources by reliability
- Tier 1: Official blogs (OpenAI, Anthropic, Google)
- Tier 2: News sites (TechCrunch, Verge)
- Tier 3: Aggregators (Reddit, HN)

Task 2.4: Implement source health scoring

# Track success rates per source
source_health = {
    'openai': {'attempts': 100, 'success': 98, 'score': 0.98},
    'medium': {'attempts': 100, 'success': 45, 'score': 0.45},
}

Task 2.5: Auto-disable sources with <30% success rate

Phase 3: Monitoring & Alerts (Days 8-10)

Task 3.1: Create monitor-pipeline.sh
- Check articles scraped > 10
- Check pipeline runtime < 120 minutes
- Check latest article age < 24 hours
Task 3.2: Set up heartbeat monitoring
- Add to HEARTBEAT.md
- Alert if pipeline fails 2 days in a row

Task 3.3: Create weekly health report cron job

# Weekly report: source stats, article counts, error rates

Task 3.4: Dashboard for source health
- Show last 7 days of scraping stats
- Success rates per source
- Articles published per day

Phase 4: Optimization (Days 11-14)

Task 4.1: Parallel scraping
- Use asyncio or multiprocessing
- Reduce pipeline time from 90min → 30min
Task 4.2: Smart article selection
- Prioritize trending topics
- Avoid duplicate content
- Better topic clustering
Task 4.3: Image extraction improvements
- Better image quality filtering
- Fallback to AI-generated images
- Optimize image loading
Task 4.4: Translation quality improvements
- A/B test different Claude prompts
- Add human review for top articles
- Build glossary of technical terms

🔔 Monitoring Setup

Daily Checks (via Heartbeat)

Add to HEARTBEAT.md:

## Burmddit Pipeline Health

**Check every 2nd heartbeat (every ~1 hour):**

1. Run: `/home/ubuntu/.openclaw/workspace/burmddit/scripts/check-pipeline-health.sh`
2. If articles_scraped < 10: Alert immediately
3. If pipeline failed: Check logs and report error

Weekly Report (via Cron)

Already set up! Runs Wednesdays at 9 AM.

📈 Success Metrics

Week 1 Targets

✅ 0 → 30+ articles scraped per day
✅ At least 5/8 existing sources working
✅ Pipeline completion success rate >80%

Week 2 Targets

✅ 30 total sources active
✅ 50+ articles scraped per day
✅ Source health monitoring active

Week 3 Targets

✅ 30-40 articles published per day
✅ Auto-recovery from errors
✅ Weekly reports sent automatically

Month 1 Goals

🎯 1,200+ articles published (40/day avg)
🎯 Google AdSense eligible (1000+ articles)
🎯 10,000+ page views/month

🚨 Immediate Actions (Today)

Install dependencies:

pip3 install trafilatura readability-lxml fake-useragent

Create scraper_v2.py (see next file)

Test manual scrape:

python3 test_scraper.py --source openai --limit 5

Fix and deploy by tomorrow morning (before 1 AM UTC run)

📁 New Files to Create

/backend/scraper_v2.py - Improved scraper
/backend/test_scraper.py - Individual source tester
/scripts/monitor-pipeline.sh - Health check script
/scripts/check-pipeline-health.sh - Quick status check
/scripts/source-health-report.py - Weekly stats

Next Step: Create scraper_v2.py with robust fallback methods

10 KiB Raw Blame History