Frontend changes: - Add /admin dashboard for article management - Add AdminButton component (Alt+Shift+A on articles) - Add /api/admin/article API endpoints Backend improvements: - scraper_v2.py: Multi-layer fallback extraction (newspaper → trafilatura → readability) - translator_v2.py: Better chunking, repetition detection, validation - admin_tools.py: CLI admin commands - test_scraper.py: Individual source testing Docs: - WEB-ADMIN-GUIDE.md: Web admin usage - ADMIN-GUIDE.md: CLI admin usage - SCRAPER-IMPROVEMENT-PLAN.md: Scraper fixes details - TRANSLATION-FIX.md: Translation improvements - ADMIN-FEATURES-SUMMARY.md: Implementation summary Fixes: - Article scraping from 0 → 96+ articles working - Translation quality issues (repetition, truncation) - Added 13 new RSS sources
10 KiB
Burmddit Web Scraper Improvement Plan
Date: 2026-02-26
Status: 🚧 In Progress
Goal: Fix scraper errors & expand to 30+ reliable AI news sources
📊 Current Status
Issues Identified
Pipeline Status:
- ✅ Running daily at 1:00 AM UTC (9 AM Singapore)
- ❌ 0 articles scraped since Feb 21
- 📉 Stuck at 87 articles total
- ⏰ Last successful run: Feb 21, 2026
Scraper Errors:
-
newspaper3k library failures:
You must download() an article first!- Affects: ArsTechnica, other sources
-
Python exceptions:
'set' object is not subscriptable- Affects: HackerNews, various sources
-
Network errors:
- 403 Forbidden responses
- Sites blocking bot user agents
Current Sources (8)
- ✅ Medium (8 AI tags)
- ❌ TechCrunch AI
- ❌ VentureBeat AI
- ❌ MIT Tech Review
- ❌ The Verge AI
- ❌ Wired AI
- ❌ Ars Technica
- ❌ Hacker News
🎯 Goals
Phase 1: Fix Existing Scraper (Week 1)
- Debug and fix
newspaper3kerrors - Implement fallback scraping methods
- Add error handling and retries
- Test all 8 existing sources
Phase 2: Expand Sources (Week 2)
- Add 22 new RSS feeds
- Test each source individually
- Implement source health monitoring
- Balance scraping load
Phase 3: Improve Pipeline (Week 3)
- Optimize article clustering
- Improve translation quality
- Add automatic health checks
- Set up alerts for failures
🔧 Technical Improvements
1. Replace newspaper3k
Problem: Unreliable, outdated library
Solution: Multi-layer scraping approach
# Priority order:
1. Try newspaper3k (fast, but unreliable)
2. Fallback to BeautifulSoup + trafilatura (more reliable)
3. Fallback to requests + custom extractors
4. Skip article if all methods fail
2. Better Error Handling
def scrape_with_fallback(url: str) -> Optional[Dict]:
"""Try multiple extraction methods"""
methods = [
extract_with_newspaper,
extract_with_trafilatura,
extract_with_beautifulsoup,
]
for method in methods:
try:
article = method(url)
if article and len(article['content']) > 500:
return article
except Exception as e:
logger.debug(f"{method.__name__} failed: {e}")
continue
logger.warning(f"All methods failed for {url}")
return None
3. Rate Limiting & Headers
# Better user agent rotation
USER_AGENTS = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36',
# ... more agents
]
# Respectful scraping
RATE_LIMITS = {
'requests_per_domain': 10, # max per domain per run
'delay_between_requests': 3, # seconds
'timeout': 15, # seconds
'max_retries': 2
}
4. Health Monitoring
Create monitor-pipeline.sh:
#!/bin/bash
# Check if pipeline is healthy
LATEST_LOG=$(ls -t /home/ubuntu/.openclaw/workspace/burmddit/logs/pipeline-*.log | head -1)
ARTICLES_SCRAPED=$(grep "Total articles scraped:" "$LATEST_LOG" | tail -1 | grep -oP '\d+')
if [ "$ARTICLES_SCRAPED" -lt 10 ]; then
echo "⚠️ WARNING: Only $ARTICLES_SCRAPED articles scraped!"
echo "Check logs: $LATEST_LOG"
exit 1
fi
echo "✅ Pipeline healthy: $ARTICLES_SCRAPED articles scraped"
📰 New RSS Feed Sources (22 Added)
Top Priority (10 sources)
-
OpenAI Blog
- URL:
https://openai.com/blog/rss/ - Quality: 🔥🔥🔥 (Official source)
- URL:
-
Anthropic Blog
- URL:
https://www.anthropic.com/rss - Quality: 🔥🔥🔥
- URL:
-
Hugging Face Blog
- URL:
https://huggingface.co/blog/feed.xml - Quality: 🔥🔥🔥
- URL:
-
Google AI Blog
- URL:
http://googleaiblog.blogspot.com/atom.xml - Quality: 🔥🔥🔥
- URL:
-
The Rundown AI
- URL:
https://rss.beehiiv.com/feeds/2R3C6Bt5wj.xml - Quality: 🔥🔥 (Daily newsletter)
- URL:
-
Last Week in AI
- URL:
https://lastweekin.ai/feed - Quality: 🔥🔥 (Weekly summary)
- URL:
-
MarkTechPost
- URL:
https://www.marktechpost.com/feed/ - Quality: 🔥🔥 (Daily AI news)
- URL:
-
Analytics India Magazine
- URL:
https://analyticsindiamag.com/feed/ - Quality: 🔥 (Multiple daily posts)
- URL:
-
AI News (AINews.com)
- URL:
https://www.artificialintelligence-news.com/feed/rss/ - Quality: 🔥🔥
- URL:
-
KDnuggets
- URL:
https://www.kdnuggets.com/feed - Quality: 🔥🔥 (ML/AI tutorials)
- URL:
Secondary Sources (12 sources)
-
Latent Space
- URL:
https://www.latent.space/feed
- URL:
-
The Gradient
- URL:
https://thegradient.pub/rss/
- URL:
-
The Algorithmic Bridge
- URL:
https://thealgorithmicbridge.substack.com/feed
- URL:
-
Simon Willison's Weblog
- URL:
https://simonwillison.net/atom/everything/
- URL:
-
Interconnects
- URL:
https://www.interconnects.ai/feed
- URL:
-
THE DECODER
- URL:
https://the-decoder.com/feed/
- URL:
-
AI Business
- URL:
https://aibusiness.com/rss.xml
- URL:
-
Unite.AI
- URL:
https://www.unite.ai/feed/
- URL:
-
ScienceDaily AI
- URL:
https://www.sciencedaily.com/rss/computers_math/artificial_intelligence.xml
- URL:
-
The Guardian AI
- URL:
https://www.theguardian.com/technology/artificialintelligenceai/rss
- URL:
-
Reuters Technology
- URL:
https://www.reutersagency.com/feed/?best-topics=tech
- URL:
-
IEEE Spectrum AI
- URL:
https://spectrum.ieee.org/feeds/topic/artificial-intelligence.rss
- URL:
📋 Implementation Tasks
Phase 1: Emergency Fixes (Days 1-3)
-
Task 1.1: Install
trafilaturalibrarycd /home/ubuntu/.openclaw/workspace/burmddit/backend pip3 install trafilatura readability-lxml -
Task 1.2: Create new
scraper_v2.pywith fallback methods- Implement multi-method extraction
- Add user agent rotation
- Better error handling
- Retry logic with exponential backoff
-
Task 1.3: Test each existing source manually
- Medium
- TechCrunch
- VentureBeat
- MIT Tech Review
- The Verge
- Wired
- Ars Technica
- Hacker News
-
Task 1.4: Update
config.pywith working sources only -
Task 1.5: Run test pipeline
cd /home/ubuntu/.openclaw/workspace/burmddit/backend python3 run_pipeline.py
Phase 2: Add New Sources (Days 4-7)
-
Task 2.1: Update
config.pywith 22 new RSS feeds -
Task 2.2: Test each new source individually
- Create
test_source.pyscript - Verify article quality
- Check extraction success rate
- Create
-
Task 2.3: Categorize sources by reliability
- Tier 1: Official blogs (OpenAI, Anthropic, Google)
- Tier 2: News sites (TechCrunch, Verge)
- Tier 3: Aggregators (Reddit, HN)
-
Task 2.4: Implement source health scoring
# Track success rates per source source_health = { 'openai': {'attempts': 100, 'success': 98, 'score': 0.98}, 'medium': {'attempts': 100, 'success': 45, 'score': 0.45}, } -
Task 2.5: Auto-disable sources with <30% success rate
Phase 3: Monitoring & Alerts (Days 8-10)
-
Task 3.1: Create
monitor-pipeline.sh- Check articles scraped > 10
- Check pipeline runtime < 120 minutes
- Check latest article age < 24 hours
-
Task 3.2: Set up heartbeat monitoring
- Add to
HEARTBEAT.md - Alert if pipeline fails 2 days in a row
- Add to
-
Task 3.3: Create weekly health report cron job
# Weekly report: source stats, article counts, error rates -
Task 3.4: Dashboard for source health
- Show last 7 days of scraping stats
- Success rates per source
- Articles published per day
Phase 4: Optimization (Days 11-14)
-
Task 4.1: Parallel scraping
- Use
asyncioormultiprocessing - Reduce pipeline time from 90min → 30min
- Use
-
Task 4.2: Smart article selection
- Prioritize trending topics
- Avoid duplicate content
- Better topic clustering
-
Task 4.3: Image extraction improvements
- Better image quality filtering
- Fallback to AI-generated images
- Optimize image loading
-
Task 4.4: Translation quality improvements
- A/B test different Claude prompts
- Add human review for top articles
- Build glossary of technical terms
🔔 Monitoring Setup
Daily Checks (via Heartbeat)
Add to HEARTBEAT.md:
## Burmddit Pipeline Health
**Check every 2nd heartbeat (every ~1 hour):**
1. Run: `/home/ubuntu/.openclaw/workspace/burmddit/scripts/check-pipeline-health.sh`
2. If articles_scraped < 10: Alert immediately
3. If pipeline failed: Check logs and report error
Weekly Report (via Cron)
Already set up! Runs Wednesdays at 9 AM.
📈 Success Metrics
Week 1 Targets
- ✅ 0 → 30+ articles scraped per day
- ✅ At least 5/8 existing sources working
- ✅ Pipeline completion success rate >80%
Week 2 Targets
- ✅ 30 total sources active
- ✅ 50+ articles scraped per day
- ✅ Source health monitoring active
Week 3 Targets
- ✅ 30-40 articles published per day
- ✅ Auto-recovery from errors
- ✅ Weekly reports sent automatically
Month 1 Goals
- 🎯 1,200+ articles published (40/day avg)
- 🎯 Google AdSense eligible (1000+ articles)
- 🎯 10,000+ page views/month
🚨 Immediate Actions (Today)
-
Install dependencies:
pip3 install trafilatura readability-lxml fake-useragent -
Create scraper_v2.py (see next file)
-
Test manual scrape:
python3 test_scraper.py --source openai --limit 5 -
Fix and deploy by tomorrow morning (before 1 AM UTC run)
📁 New Files to Create
/backend/scraper_v2.py- Improved scraper/backend/test_scraper.py- Individual source tester/scripts/monitor-pipeline.sh- Health check script/scripts/check-pipeline-health.sh- Quick status check/scripts/source-health-report.py- Weekly stats
Next Step: Create scraper_v2.py with robust fallback methods