forked from minzeyaphyo/burmddit
Frontend changes: - Add /admin dashboard for article management - Add AdminButton component (Alt+Shift+A on articles) - Add /api/admin/article API endpoints Backend improvements: - scraper_v2.py: Multi-layer fallback extraction (newspaper → trafilatura → readability) - translator_v2.py: Better chunking, repetition detection, validation - admin_tools.py: CLI admin commands - test_scraper.py: Individual source testing Docs: - WEB-ADMIN-GUIDE.md: Web admin usage - ADMIN-GUIDE.md: CLI admin usage - SCRAPER-IMPROVEMENT-PLAN.md: Scraper fixes details - TRANSLATION-FIX.md: Translation improvements - ADMIN-FEATURES-SUMMARY.md: Implementation summary Fixes: - Article scraping from 0 → 96+ articles working - Translation quality issues (repetition, truncation) - Added 13 new RSS sources
412 lines
10 KiB
Markdown
412 lines
10 KiB
Markdown
# Burmddit Web Scraper Improvement Plan
|
|
|
|
**Date:** 2026-02-26
|
|
**Status:** 🚧 In Progress
|
|
**Goal:** Fix scraper errors & expand to 30+ reliable AI news sources
|
|
|
|
---
|
|
|
|
## 📊 Current Status
|
|
|
|
### Issues Identified
|
|
|
|
**Pipeline Status:**
|
|
- ✅ Running daily at 1:00 AM UTC (9 AM Singapore)
|
|
- ❌ **0 articles scraped** since Feb 21
|
|
- 📉 Stuck at 87 articles total
|
|
- ⏰ Last successful run: Feb 21, 2026
|
|
|
|
**Scraper Errors:**
|
|
|
|
1. **newspaper3k library failures:**
|
|
- `You must download() an article first!`
|
|
- Affects: ArsTechnica, other sources
|
|
|
|
2. **Python exceptions:**
|
|
- `'set' object is not subscriptable`
|
|
- Affects: HackerNews, various sources
|
|
|
|
3. **Network errors:**
|
|
- 403 Forbidden responses
|
|
- Sites blocking bot user agents
|
|
|
|
### Current Sources (8)
|
|
|
|
1. ✅ Medium (8 AI tags)
|
|
2. ❌ TechCrunch AI
|
|
3. ❌ VentureBeat AI
|
|
4. ❌ MIT Tech Review
|
|
5. ❌ The Verge AI
|
|
6. ❌ Wired AI
|
|
7. ❌ Ars Technica
|
|
8. ❌ Hacker News
|
|
|
|
---
|
|
|
|
## 🎯 Goals
|
|
|
|
### Phase 1: Fix Existing Scraper (Week 1)
|
|
- [ ] Debug and fix `newspaper3k` errors
|
|
- [ ] Implement fallback scraping methods
|
|
- [ ] Add error handling and retries
|
|
- [ ] Test all 8 existing sources
|
|
|
|
### Phase 2: Expand Sources (Week 2)
|
|
- [ ] Add 22 new RSS feeds
|
|
- [ ] Test each source individually
|
|
- [ ] Implement source health monitoring
|
|
- [ ] Balance scraping load
|
|
|
|
### Phase 3: Improve Pipeline (Week 3)
|
|
- [ ] Optimize article clustering
|
|
- [ ] Improve translation quality
|
|
- [ ] Add automatic health checks
|
|
- [ ] Set up alerts for failures
|
|
|
|
---
|
|
|
|
## 🔧 Technical Improvements
|
|
|
|
### 1. Replace newspaper3k
|
|
|
|
**Problem:** Unreliable, outdated library
|
|
|
|
**Solution:** Multi-layer scraping approach
|
|
|
|
```python
|
|
# Priority order:
|
|
1. Try newspaper3k (fast, but unreliable)
|
|
2. Fallback to BeautifulSoup + trafilatura (more reliable)
|
|
3. Fallback to requests + custom extractors
|
|
4. Skip article if all methods fail
|
|
```
|
|
|
|
### 2. Better Error Handling
|
|
|
|
```python
|
|
def scrape_with_fallback(url: str) -> Optional[Dict]:
|
|
"""Try multiple extraction methods"""
|
|
methods = [
|
|
extract_with_newspaper,
|
|
extract_with_trafilatura,
|
|
extract_with_beautifulsoup,
|
|
]
|
|
|
|
for method in methods:
|
|
try:
|
|
article = method(url)
|
|
if article and len(article['content']) > 500:
|
|
return article
|
|
except Exception as e:
|
|
logger.debug(f"{method.__name__} failed: {e}")
|
|
continue
|
|
|
|
logger.warning(f"All methods failed for {url}")
|
|
return None
|
|
```
|
|
|
|
### 3. Rate Limiting & Headers
|
|
|
|
```python
|
|
# Better user agent rotation
|
|
USER_AGENTS = [
|
|
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
|
|
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36',
|
|
# ... more agents
|
|
]
|
|
|
|
# Respectful scraping
|
|
RATE_LIMITS = {
|
|
'requests_per_domain': 10, # max per domain per run
|
|
'delay_between_requests': 3, # seconds
|
|
'timeout': 15, # seconds
|
|
'max_retries': 2
|
|
}
|
|
```
|
|
|
|
### 4. Health Monitoring
|
|
|
|
Create `monitor-pipeline.sh`:
|
|
|
|
```bash
|
|
#!/bin/bash
|
|
# Check if pipeline is healthy
|
|
|
|
LATEST_LOG=$(ls -t /home/ubuntu/.openclaw/workspace/burmddit/logs/pipeline-*.log | head -1)
|
|
ARTICLES_SCRAPED=$(grep "Total articles scraped:" "$LATEST_LOG" | tail -1 | grep -oP '\d+')
|
|
|
|
if [ "$ARTICLES_SCRAPED" -lt 10 ]; then
|
|
echo "⚠️ WARNING: Only $ARTICLES_SCRAPED articles scraped!"
|
|
echo "Check logs: $LATEST_LOG"
|
|
exit 1
|
|
fi
|
|
|
|
echo "✅ Pipeline healthy: $ARTICLES_SCRAPED articles scraped"
|
|
```
|
|
|
|
---
|
|
|
|
## 📰 New RSS Feed Sources (22 Added)
|
|
|
|
### Top Priority (10 sources)
|
|
|
|
1. **OpenAI Blog**
|
|
- URL: `https://openai.com/blog/rss/`
|
|
- Quality: 🔥🔥🔥 (Official source)
|
|
|
|
2. **Anthropic Blog**
|
|
- URL: `https://www.anthropic.com/rss`
|
|
- Quality: 🔥🔥🔥
|
|
|
|
3. **Hugging Face Blog**
|
|
- URL: `https://huggingface.co/blog/feed.xml`
|
|
- Quality: 🔥🔥🔥
|
|
|
|
4. **Google AI Blog**
|
|
- URL: `http://googleaiblog.blogspot.com/atom.xml`
|
|
- Quality: 🔥🔥🔥
|
|
|
|
5. **The Rundown AI**
|
|
- URL: `https://rss.beehiiv.com/feeds/2R3C6Bt5wj.xml`
|
|
- Quality: 🔥🔥 (Daily newsletter)
|
|
|
|
6. **Last Week in AI**
|
|
- URL: `https://lastweekin.ai/feed`
|
|
- Quality: 🔥🔥 (Weekly summary)
|
|
|
|
7. **MarkTechPost**
|
|
- URL: `https://www.marktechpost.com/feed/`
|
|
- Quality: 🔥🔥 (Daily AI news)
|
|
|
|
8. **Analytics India Magazine**
|
|
- URL: `https://analyticsindiamag.com/feed/`
|
|
- Quality: 🔥 (Multiple daily posts)
|
|
|
|
9. **AI News (AINews.com)**
|
|
- URL: `https://www.artificialintelligence-news.com/feed/rss/`
|
|
- Quality: 🔥🔥
|
|
|
|
10. **KDnuggets**
|
|
- URL: `https://www.kdnuggets.com/feed`
|
|
- Quality: 🔥🔥 (ML/AI tutorials)
|
|
|
|
### Secondary Sources (12 sources)
|
|
|
|
11. **Latent Space**
|
|
- URL: `https://www.latent.space/feed`
|
|
|
|
12. **The Gradient**
|
|
- URL: `https://thegradient.pub/rss/`
|
|
|
|
13. **The Algorithmic Bridge**
|
|
- URL: `https://thealgorithmicbridge.substack.com/feed`
|
|
|
|
14. **Simon Willison's Weblog**
|
|
- URL: `https://simonwillison.net/atom/everything/`
|
|
|
|
15. **Interconnects**
|
|
- URL: `https://www.interconnects.ai/feed`
|
|
|
|
16. **THE DECODER**
|
|
- URL: `https://the-decoder.com/feed/`
|
|
|
|
17. **AI Business**
|
|
- URL: `https://aibusiness.com/rss.xml`
|
|
|
|
18. **Unite.AI**
|
|
- URL: `https://www.unite.ai/feed/`
|
|
|
|
19. **ScienceDaily AI**
|
|
- URL: `https://www.sciencedaily.com/rss/computers_math/artificial_intelligence.xml`
|
|
|
|
20. **The Guardian AI**
|
|
- URL: `https://www.theguardian.com/technology/artificialintelligenceai/rss`
|
|
|
|
21. **Reuters Technology**
|
|
- URL: `https://www.reutersagency.com/feed/?best-topics=tech`
|
|
|
|
22. **IEEE Spectrum AI**
|
|
- URL: `https://spectrum.ieee.org/feeds/topic/artificial-intelligence.rss`
|
|
|
|
---
|
|
|
|
## 📋 Implementation Tasks
|
|
|
|
### Phase 1: Emergency Fixes (Days 1-3)
|
|
|
|
- [ ] **Task 1.1:** Install `trafilatura` library
|
|
```bash
|
|
cd /home/ubuntu/.openclaw/workspace/burmddit/backend
|
|
pip3 install trafilatura readability-lxml
|
|
```
|
|
|
|
- [ ] **Task 1.2:** Create new `scraper_v2.py` with fallback methods
|
|
- [ ] Implement multi-method extraction
|
|
- [ ] Add user agent rotation
|
|
- [ ] Better error handling
|
|
- [ ] Retry logic with exponential backoff
|
|
|
|
- [ ] **Task 1.3:** Test each existing source manually
|
|
- [ ] Medium
|
|
- [ ] TechCrunch
|
|
- [ ] VentureBeat
|
|
- [ ] MIT Tech Review
|
|
- [ ] The Verge
|
|
- [ ] Wired
|
|
- [ ] Ars Technica
|
|
- [ ] Hacker News
|
|
|
|
- [ ] **Task 1.4:** Update `config.py` with working sources only
|
|
|
|
- [ ] **Task 1.5:** Run test pipeline
|
|
```bash
|
|
cd /home/ubuntu/.openclaw/workspace/burmddit/backend
|
|
python3 run_pipeline.py
|
|
```
|
|
|
|
### Phase 2: Add New Sources (Days 4-7)
|
|
|
|
- [ ] **Task 2.1:** Update `config.py` with 22 new RSS feeds
|
|
|
|
- [ ] **Task 2.2:** Test each new source individually
|
|
- [ ] Create `test_source.py` script
|
|
- [ ] Verify article quality
|
|
- [ ] Check extraction success rate
|
|
|
|
- [ ] **Task 2.3:** Categorize sources by reliability
|
|
- [ ] Tier 1: Official blogs (OpenAI, Anthropic, Google)
|
|
- [ ] Tier 2: News sites (TechCrunch, Verge)
|
|
- [ ] Tier 3: Aggregators (Reddit, HN)
|
|
|
|
- [ ] **Task 2.4:** Implement source health scoring
|
|
```python
|
|
# Track success rates per source
|
|
source_health = {
|
|
'openai': {'attempts': 100, 'success': 98, 'score': 0.98},
|
|
'medium': {'attempts': 100, 'success': 45, 'score': 0.45},
|
|
}
|
|
```
|
|
|
|
- [ ] **Task 2.5:** Auto-disable sources with <30% success rate
|
|
|
|
### Phase 3: Monitoring & Alerts (Days 8-10)
|
|
|
|
- [ ] **Task 3.1:** Create `monitor-pipeline.sh`
|
|
- [ ] Check articles scraped > 10
|
|
- [ ] Check pipeline runtime < 120 minutes
|
|
- [ ] Check latest article age < 24 hours
|
|
|
|
- [ ] **Task 3.2:** Set up heartbeat monitoring
|
|
- [ ] Add to `HEARTBEAT.md`
|
|
- [ ] Alert if pipeline fails 2 days in a row
|
|
|
|
- [ ] **Task 3.3:** Create weekly health report cron job
|
|
```python
|
|
# Weekly report: source stats, article counts, error rates
|
|
```
|
|
|
|
- [ ] **Task 3.4:** Dashboard for source health
|
|
- [ ] Show last 7 days of scraping stats
|
|
- [ ] Success rates per source
|
|
- [ ] Articles published per day
|
|
|
|
### Phase 4: Optimization (Days 11-14)
|
|
|
|
- [ ] **Task 4.1:** Parallel scraping
|
|
- [ ] Use `asyncio` or `multiprocessing`
|
|
- [ ] Reduce pipeline time from 90min → 30min
|
|
|
|
- [ ] **Task 4.2:** Smart article selection
|
|
- [ ] Prioritize trending topics
|
|
- [ ] Avoid duplicate content
|
|
- [ ] Better topic clustering
|
|
|
|
- [ ] **Task 4.3:** Image extraction improvements
|
|
- [ ] Better image quality filtering
|
|
- [ ] Fallback to AI-generated images
|
|
- [ ] Optimize image loading
|
|
|
|
- [ ] **Task 4.4:** Translation quality improvements
|
|
- [ ] A/B test different Claude prompts
|
|
- [ ] Add human review for top articles
|
|
- [ ] Build glossary of technical terms
|
|
|
|
---
|
|
|
|
## 🔔 Monitoring Setup
|
|
|
|
### Daily Checks (via Heartbeat)
|
|
|
|
Add to `HEARTBEAT.md`:
|
|
|
|
```markdown
|
|
## Burmddit Pipeline Health
|
|
|
|
**Check every 2nd heartbeat (every ~1 hour):**
|
|
|
|
1. Run: `/home/ubuntu/.openclaw/workspace/burmddit/scripts/check-pipeline-health.sh`
|
|
2. If articles_scraped < 10: Alert immediately
|
|
3. If pipeline failed: Check logs and report error
|
|
```
|
|
|
|
### Weekly Report (via Cron)
|
|
|
|
Already set up! Runs Wednesdays at 9 AM.
|
|
|
|
---
|
|
|
|
## 📈 Success Metrics
|
|
|
|
### Week 1 Targets
|
|
- ✅ 0 → 30+ articles scraped per day
|
|
- ✅ At least 5/8 existing sources working
|
|
- ✅ Pipeline completion success rate >80%
|
|
|
|
### Week 2 Targets
|
|
- ✅ 30 total sources active
|
|
- ✅ 50+ articles scraped per day
|
|
- ✅ Source health monitoring active
|
|
|
|
### Week 3 Targets
|
|
- ✅ 30-40 articles published per day
|
|
- ✅ Auto-recovery from errors
|
|
- ✅ Weekly reports sent automatically
|
|
|
|
### Month 1 Goals
|
|
- 🎯 1,200+ articles published (40/day avg)
|
|
- 🎯 Google AdSense eligible (1000+ articles)
|
|
- 🎯 10,000+ page views/month
|
|
|
|
---
|
|
|
|
## 🚨 Immediate Actions (Today)
|
|
|
|
1. **Install dependencies:**
|
|
```bash
|
|
pip3 install trafilatura readability-lxml fake-useragent
|
|
```
|
|
|
|
2. **Create scraper_v2.py** (see next file)
|
|
|
|
3. **Test manual scrape:**
|
|
```bash
|
|
python3 test_scraper.py --source openai --limit 5
|
|
```
|
|
|
|
4. **Fix and deploy by tomorrow morning** (before 1 AM UTC run)
|
|
|
|
---
|
|
|
|
## 📁 New Files to Create
|
|
|
|
1. `/backend/scraper_v2.py` - Improved scraper
|
|
2. `/backend/test_scraper.py` - Individual source tester
|
|
3. `/scripts/monitor-pipeline.sh` - Health check script
|
|
4. `/scripts/check-pipeline-health.sh` - Quick status check
|
|
5. `/scripts/source-health-report.py` - Weekly stats
|
|
|
|
---
|
|
|
|
**Next Step:** Create `scraper_v2.py` with robust fallback methods
|
|
|