Add web admin features + fix scraper & translator

Frontend changes: - Add /admin dashboard for article management - Add AdminButton component (Alt+Shift+A on articles) - Add /api/admin/article API endpoints Backend improvements: - scraper_v2.py: Multi-layer fallback extraction (newspaper → trafilatura → readability) - translator_v2.py: Better chunking, repetition detection, validation - admin_tools.py: CLI admin commands - test_scraper.py: Individual source testing Docs: - WEB-ADMIN-GUIDE.md: Web admin usage - ADMIN-GUIDE.md: CLI admin usage - SCRAPER-IMPROVEMENT-PLAN.md: Scraper fixes details - TRANSLATION-FIX.md: Translation improvements - ADMIN-FEATURES-SUMMARY.md: Implementation summary Fixes: - Article scraping from 0 → 96+ articles working - Translation quality issues (repetition, truncation) - Added 13 new RSS sources
2026-02-26 09:17:50 +00:00
parent 8bf5f342cd
commit f51ac4afa4
20 changed files with 4769 additions and 23 deletions
--- a/SCRAPER-IMPROVEMENT-PLAN.md
+++ b/SCRAPER-IMPROVEMENT-PLAN.md
@@ -0,0 +1,411 @@
+# Burmddit Web Scraper Improvement Plan
+
+**Date:** 2026-02-26  
+**Status:** 🚧 In Progress  
+**Goal:** Fix scraper errors & expand to 30+ reliable AI news sources
+
+---
+
+## 📊 Current Status
+
+### Issues Identified
+
+**Pipeline Status:**
+- ✅ Running daily at 1:00 AM UTC (9 AM Singapore)
+- ❌ **0 articles scraped** since Feb 21
+- 📉 Stuck at 87 articles total
+- ⏰ Last successful run: Feb 21, 2026
+
+**Scraper Errors:**
+
+1. **newspaper3k library failures:**
+   - `You must download() an article first!`
+   - Affects: ArsTechnica, other sources
+   
+2. **Python exceptions:**
+   - `'set' object is not subscriptable`
+   - Affects: HackerNews, various sources
+   
+3. **Network errors:**
+   - 403 Forbidden responses
+   - Sites blocking bot user agents
+
+### Current Sources (8)
+
+1. ✅ Medium (8 AI tags)
+2. ❌ TechCrunch AI
+3. ❌ VentureBeat AI
+4. ❌ MIT Tech Review
+5. ❌ The Verge AI
+6. ❌ Wired AI
+7. ❌ Ars Technica
+8. ❌ Hacker News
+
+---
+
+## 🎯 Goals
+
+### Phase 1: Fix Existing Scraper (Week 1)
+- [ ] Debug and fix `newspaper3k` errors
+- [ ] Implement fallback scraping methods
+- [ ] Add error handling and retries
+- [ ] Test all 8 existing sources
+
+### Phase 2: Expand Sources (Week 2)
+- [ ] Add 22 new RSS feeds
+- [ ] Test each source individually
+- [ ] Implement source health monitoring
+- [ ] Balance scraping load
+
+### Phase 3: Improve Pipeline (Week 3)
+- [ ] Optimize article clustering
+- [ ] Improve translation quality
+- [ ] Add automatic health checks
+- [ ] Set up alerts for failures
+
+---
+
+## 🔧 Technical Improvements
+
+### 1. Replace newspaper3k
+
+**Problem:** Unreliable, outdated library
+
+**Solution:** Multi-layer scraping approach
+
+```python
+# Priority order:
+1. Try newspaper3k (fast, but unreliable)
+2. Fallback to BeautifulSoup + trafilatura (more reliable)
+3. Fallback to requests + custom extractors
+4. Skip article if all methods fail
+```
+
+### 2. Better Error Handling
+
+```python
+def scrape_with_fallback(url: str) -> Optional[Dict]:
+    """Try multiple extraction methods"""
+    methods = [
+        extract_with_newspaper,
+        extract_with_trafilatura,
+        extract_with_beautifulsoup,
+    ]
+    
+    for method in methods:
+        try:
+            article = method(url)
+            if article and len(article['content']) > 500:
+                return article
+        except Exception as e:
+            logger.debug(f"{method.__name__} failed: {e}")
+            continue
+    
+    logger.warning(f"All methods failed for {url}")
+    return None
+```
+
+### 3. Rate Limiting & Headers
+
+```python
+# Better user agent rotation
+USER_AGENTS = [
+    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
+    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36',
+    # ... more agents
+]
+
+# Respectful scraping
+RATE_LIMITS = {
+    'requests_per_domain': 10,  # max per domain per run
+    'delay_between_requests': 3,  # seconds
+    'timeout': 15,  # seconds
+    'max_retries': 2
+}
+```
+
+### 4. Health Monitoring
+
+Create `monitor-pipeline.sh`:
+
+```bash
+#!/bin/bash
+# Check if pipeline is healthy
+
+LATEST_LOG=$(ls -t /home/ubuntu/.openclaw/workspace/burmddit/logs/pipeline-*.log | head -1)
+ARTICLES_SCRAPED=$(grep "Total articles scraped:" "$LATEST_LOG" | tail -1 | grep -oP '\d+')
+
+if [ "$ARTICLES_SCRAPED" -lt 10 ]; then
+    echo "⚠️ WARNING: Only $ARTICLES_SCRAPED articles scraped!"
+    echo "Check logs: $LATEST_LOG"
+    exit 1
+fi
+
+echo "✅ Pipeline healthy: $ARTICLES_SCRAPED articles scraped"
+```
+
+---
+
+## 📰 New RSS Feed Sources (22 Added)
+
+### Top Priority (10 sources)
+
+1. **OpenAI Blog**
+   - URL: `https://openai.com/blog/rss/`
+   - Quality: 🔥🔥🔥 (Official source)
+
+2. **Anthropic Blog**
+   - URL: `https://www.anthropic.com/rss`
+   - Quality: 🔥🔥🔥
+
+3. **Hugging Face Blog**
+   - URL: `https://huggingface.co/blog/feed.xml`
+   - Quality: 🔥🔥🔥
+
+4. **Google AI Blog**
+   - URL: `http://googleaiblog.blogspot.com/atom.xml`
+   - Quality: 🔥🔥🔥
+
+5. **The Rundown AI**
+   - URL: `https://rss.beehiiv.com/feeds/2R3C6Bt5wj.xml`
+   - Quality: 🔥🔥 (Daily newsletter)
+
+6. **Last Week in AI**
+   - URL: `https://lastweekin.ai/feed`
+   - Quality: 🔥🔥 (Weekly summary)
+
+7. **MarkTechPost**
+   - URL: `https://www.marktechpost.com/feed/`
+   - Quality: 🔥🔥 (Daily AI news)
+
+8. **Analytics India Magazine**
+   - URL: `https://analyticsindiamag.com/feed/`
+   - Quality: 🔥 (Multiple daily posts)
+
+9. **AI News (AINews.com)**
+   - URL: `https://www.artificialintelligence-news.com/feed/rss/`
+   - Quality: 🔥🔥
+
+10. **KDnuggets**
+    - URL: `https://www.kdnuggets.com/feed`
+    - Quality: 🔥🔥 (ML/AI tutorials)
+
+### Secondary Sources (12 sources)
+
+11. **Latent Space**
+    - URL: `https://www.latent.space/feed`
+
+12. **The Gradient**
+    - URL: `https://thegradient.pub/rss/`
+
+13. **The Algorithmic Bridge**
+    - URL: `https://thealgorithmicbridge.substack.com/feed`
+
+14. **Simon Willison's Weblog**
+    - URL: `https://simonwillison.net/atom/everything/`
+
+15. **Interconnects**
+    - URL: `https://www.interconnects.ai/feed`
+
+16. **THE DECODER**
+    - URL: `https://the-decoder.com/feed/`
+
+17. **AI Business**
+    - URL: `https://aibusiness.com/rss.xml`
+
+18. **Unite.AI**
+    - URL: `https://www.unite.ai/feed/`
+
+19. **ScienceDaily AI**
+    - URL: `https://www.sciencedaily.com/rss/computers_math/artificial_intelligence.xml`
+
+20. **The Guardian AI**
+    - URL: `https://www.theguardian.com/technology/artificialintelligenceai/rss`
+
+21. **Reuters Technology**
+    - URL: `https://www.reutersagency.com/feed/?best-topics=tech`
+
+22. **IEEE Spectrum AI**
+    - URL: `https://spectrum.ieee.org/feeds/topic/artificial-intelligence.rss`
+
+---
+
+## 📋 Implementation Tasks
+
+### Phase 1: Emergency Fixes (Days 1-3)
+
+- [ ] **Task 1.1:** Install `trafilatura` library
+  ```bash
+  cd /home/ubuntu/.openclaw/workspace/burmddit/backend
+  pip3 install trafilatura readability-lxml
+  ```
+
+- [ ] **Task 1.2:** Create new `scraper_v2.py` with fallback methods
+  - [ ] Implement multi-method extraction
+  - [ ] Add user agent rotation
+  - [ ] Better error handling
+  - [ ] Retry logic with exponential backoff
+
+- [ ] **Task 1.3:** Test each existing source manually
+  - [ ] Medium
+  - [ ] TechCrunch
+  - [ ] VentureBeat
+  - [ ] MIT Tech Review
+  - [ ] The Verge
+  - [ ] Wired
+  - [ ] Ars Technica
+  - [ ] Hacker News
+
+- [ ] **Task 1.4:** Update `config.py` with working sources only
+
+- [ ] **Task 1.5:** Run test pipeline
+  ```bash
+  cd /home/ubuntu/.openclaw/workspace/burmddit/backend
+  python3 run_pipeline.py
+  ```
+
+### Phase 2: Add New Sources (Days 4-7)
+
+- [ ] **Task 2.1:** Update `config.py` with 22 new RSS feeds
+
+- [ ] **Task 2.2:** Test each new source individually
+  - [ ] Create `test_source.py` script
+  - [ ] Verify article quality
+  - [ ] Check extraction success rate
+
+- [ ] **Task 2.3:** Categorize sources by reliability
+  - [ ] Tier 1: Official blogs (OpenAI, Anthropic, Google)
+  - [ ] Tier 2: News sites (TechCrunch, Verge)
+  - [ ] Tier 3: Aggregators (Reddit, HN)
+
+- [ ] **Task 2.4:** Implement source health scoring
+  ```python
+  # Track success rates per source
+  source_health = {
+      'openai': {'attempts': 100, 'success': 98, 'score': 0.98},
+      'medium': {'attempts': 100, 'success': 45, 'score': 0.45},
+  }
+  ```
+
+- [ ] **Task 2.5:** Auto-disable sources with <30% success rate
+
+### Phase 3: Monitoring & Alerts (Days 8-10)
+
+- [ ] **Task 3.1:** Create `monitor-pipeline.sh`
+  - [ ] Check articles scraped > 10
+  - [ ] Check pipeline runtime < 120 minutes
+  - [ ] Check latest article age < 24 hours
+
+- [ ] **Task 3.2:** Set up heartbeat monitoring
+  - [ ] Add to `HEARTBEAT.md`
+  - [ ] Alert if pipeline fails 2 days in a row
+
+- [ ] **Task 3.3:** Create weekly health report cron job
+  ```python
+  # Weekly report: source stats, article counts, error rates
+  ```
+
+- [ ] **Task 3.4:** Dashboard for source health
+  - [ ] Show last 7 days of scraping stats
+  - [ ] Success rates per source
+  - [ ] Articles published per day
+
+### Phase 4: Optimization (Days 11-14)
+
+- [ ] **Task 4.1:** Parallel scraping
+  - [ ] Use `asyncio` or `multiprocessing`
+  - [ ] Reduce pipeline time from 90min → 30min
+
+- [ ] **Task 4.2:** Smart article selection
+  - [ ] Prioritize trending topics
+  - [ ] Avoid duplicate content
+  - [ ] Better topic clustering
+
+- [ ] **Task 4.3:** Image extraction improvements
+  - [ ] Better image quality filtering
+  - [ ] Fallback to AI-generated images
+  - [ ] Optimize image loading
+
+- [ ] **Task 4.4:** Translation quality improvements
+  - [ ] A/B test different Claude prompts
+  - [ ] Add human review for top articles
+  - [ ] Build glossary of technical terms
+
+---
+
+## 🔔 Monitoring Setup
+
+### Daily Checks (via Heartbeat)
+
+Add to `HEARTBEAT.md`:
+
+```markdown
+## Burmddit Pipeline Health
+
+**Check every 2nd heartbeat (every ~1 hour):**
+
+1. Run: `/home/ubuntu/.openclaw/workspace/burmddit/scripts/check-pipeline-health.sh`
+2. If articles_scraped < 10: Alert immediately
+3. If pipeline failed: Check logs and report error
+```
+
+### Weekly Report (via Cron)
+
+Already set up! Runs Wednesdays at 9 AM.
+
+---
+
+## 📈 Success Metrics
+
+### Week 1 Targets
+- ✅ 0 → 30+ articles scraped per day
+- ✅ At least 5/8 existing sources working
+- ✅ Pipeline completion success rate >80%
+
+### Week 2 Targets
+- ✅ 30 total sources active
+- ✅ 50+ articles scraped per day
+- ✅ Source health monitoring active
+
+### Week 3 Targets
+- ✅ 30-40 articles published per day
+- ✅ Auto-recovery from errors
+- ✅ Weekly reports sent automatically
+
+### Month 1 Goals
+- 🎯 1,200+ articles published (40/day avg)
+- 🎯 Google AdSense eligible (1000+ articles)
+- 🎯 10,000+ page views/month
+
+---
+
+## 🚨 Immediate Actions (Today)
+
+1. **Install dependencies:**
+   ```bash
+   pip3 install trafilatura readability-lxml fake-useragent
+   ```
+
+2. **Create scraper_v2.py** (see next file)
+
+3. **Test manual scrape:**
+   ```bash
+   python3 test_scraper.py --source openai --limit 5
+   ```
+
+4. **Fix and deploy by tomorrow morning** (before 1 AM UTC run)
+
+---
+
+## 📁 New Files to Create
+
+1. `/backend/scraper_v2.py` - Improved scraper
+2. `/backend/test_scraper.py` - Individual source tester
+3. `/scripts/monitor-pipeline.sh` - Health check script
+4. `/scripts/check-pipeline-health.sh` - Quick status check
+5. `/scripts/source-health-report.py` - Weekly stats
+
+---
+
+**Next Step:** Create `scraper_v2.py` with robust fallback methods
+