Add web admin features + fix scraper & translator

Frontend changes:
- Add /admin dashboard for article management
- Add AdminButton component (Alt+Shift+A on articles)
- Add /api/admin/article API endpoints

Backend improvements:
- scraper_v2.py: Multi-layer fallback extraction (newspaper → trafilatura → readability)
- translator_v2.py: Better chunking, repetition detection, validation
- admin_tools.py: CLI admin commands
- test_scraper.py: Individual source testing

Docs:
- WEB-ADMIN-GUIDE.md: Web admin usage
- ADMIN-GUIDE.md: CLI admin usage
- SCRAPER-IMPROVEMENT-PLAN.md: Scraper fixes details
- TRANSLATION-FIX.md: Translation improvements
- ADMIN-FEATURES-SUMMARY.md: Implementation summary

Fixes:
- Article scraping from 0 → 96+ articles working
- Translation quality issues (repetition, truncation)
- Added 13 new RSS sources
This commit is contained in:
Zeya Phyo
2026-02-26 09:17:50 +00:00
parent 8bf5f342cd
commit f51ac4afa4
20 changed files with 4769 additions and 23 deletions

411
SCRAPER-IMPROVEMENT-PLAN.md Normal file
View File

@@ -0,0 +1,411 @@
# Burmddit Web Scraper Improvement Plan
**Date:** 2026-02-26
**Status:** 🚧 In Progress
**Goal:** Fix scraper errors & expand to 30+ reliable AI news sources
---
## 📊 Current Status
### Issues Identified
**Pipeline Status:**
- ✅ Running daily at 1:00 AM UTC (9 AM Singapore)
-**0 articles scraped** since Feb 21
- 📉 Stuck at 87 articles total
- ⏰ Last successful run: Feb 21, 2026
**Scraper Errors:**
1. **newspaper3k library failures:**
- `You must download() an article first!`
- Affects: ArsTechnica, other sources
2. **Python exceptions:**
- `'set' object is not subscriptable`
- Affects: HackerNews, various sources
3. **Network errors:**
- 403 Forbidden responses
- Sites blocking bot user agents
### Current Sources (8)
1. ✅ Medium (8 AI tags)
2. ❌ TechCrunch AI
3. ❌ VentureBeat AI
4. ❌ MIT Tech Review
5. ❌ The Verge AI
6. ❌ Wired AI
7. ❌ Ars Technica
8. ❌ Hacker News
---
## 🎯 Goals
### Phase 1: Fix Existing Scraper (Week 1)
- [ ] Debug and fix `newspaper3k` errors
- [ ] Implement fallback scraping methods
- [ ] Add error handling and retries
- [ ] Test all 8 existing sources
### Phase 2: Expand Sources (Week 2)
- [ ] Add 22 new RSS feeds
- [ ] Test each source individually
- [ ] Implement source health monitoring
- [ ] Balance scraping load
### Phase 3: Improve Pipeline (Week 3)
- [ ] Optimize article clustering
- [ ] Improve translation quality
- [ ] Add automatic health checks
- [ ] Set up alerts for failures
---
## 🔧 Technical Improvements
### 1. Replace newspaper3k
**Problem:** Unreliable, outdated library
**Solution:** Multi-layer scraping approach
```python
# Priority order:
1. Try newspaper3k (fast, but unreliable)
2. Fallback to BeautifulSoup + trafilatura (more reliable)
3. Fallback to requests + custom extractors
4. Skip article if all methods fail
```
### 2. Better Error Handling
```python
def scrape_with_fallback(url: str) -> Optional[Dict]:
"""Try multiple extraction methods"""
methods = [
extract_with_newspaper,
extract_with_trafilatura,
extract_with_beautifulsoup,
]
for method in methods:
try:
article = method(url)
if article and len(article['content']) > 500:
return article
except Exception as e:
logger.debug(f"{method.__name__} failed: {e}")
continue
logger.warning(f"All methods failed for {url}")
return None
```
### 3. Rate Limiting & Headers
```python
# Better user agent rotation
USER_AGENTS = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36',
# ... more agents
]
# Respectful scraping
RATE_LIMITS = {
'requests_per_domain': 10, # max per domain per run
'delay_between_requests': 3, # seconds
'timeout': 15, # seconds
'max_retries': 2
}
```
### 4. Health Monitoring
Create `monitor-pipeline.sh`:
```bash
#!/bin/bash
# Check if pipeline is healthy
LATEST_LOG=$(ls -t /home/ubuntu/.openclaw/workspace/burmddit/logs/pipeline-*.log | head -1)
ARTICLES_SCRAPED=$(grep "Total articles scraped:" "$LATEST_LOG" | tail -1 | grep -oP '\d+')
if [ "$ARTICLES_SCRAPED" -lt 10 ]; then
echo "⚠️ WARNING: Only $ARTICLES_SCRAPED articles scraped!"
echo "Check logs: $LATEST_LOG"
exit 1
fi
echo "✅ Pipeline healthy: $ARTICLES_SCRAPED articles scraped"
```
---
## 📰 New RSS Feed Sources (22 Added)
### Top Priority (10 sources)
1. **OpenAI Blog**
- URL: `https://openai.com/blog/rss/`
- Quality: 🔥🔥🔥 (Official source)
2. **Anthropic Blog**
- URL: `https://www.anthropic.com/rss`
- Quality: 🔥🔥🔥
3. **Hugging Face Blog**
- URL: `https://huggingface.co/blog/feed.xml`
- Quality: 🔥🔥🔥
4. **Google AI Blog**
- URL: `http://googleaiblog.blogspot.com/atom.xml`
- Quality: 🔥🔥🔥
5. **The Rundown AI**
- URL: `https://rss.beehiiv.com/feeds/2R3C6Bt5wj.xml`
- Quality: 🔥🔥 (Daily newsletter)
6. **Last Week in AI**
- URL: `https://lastweekin.ai/feed`
- Quality: 🔥🔥 (Weekly summary)
7. **MarkTechPost**
- URL: `https://www.marktechpost.com/feed/`
- Quality: 🔥🔥 (Daily AI news)
8. **Analytics India Magazine**
- URL: `https://analyticsindiamag.com/feed/`
- Quality: 🔥 (Multiple daily posts)
9. **AI News (AINews.com)**
- URL: `https://www.artificialintelligence-news.com/feed/rss/`
- Quality: 🔥🔥
10. **KDnuggets**
- URL: `https://www.kdnuggets.com/feed`
- Quality: 🔥🔥 (ML/AI tutorials)
### Secondary Sources (12 sources)
11. **Latent Space**
- URL: `https://www.latent.space/feed`
12. **The Gradient**
- URL: `https://thegradient.pub/rss/`
13. **The Algorithmic Bridge**
- URL: `https://thealgorithmicbridge.substack.com/feed`
14. **Simon Willison's Weblog**
- URL: `https://simonwillison.net/atom/everything/`
15. **Interconnects**
- URL: `https://www.interconnects.ai/feed`
16. **THE DECODER**
- URL: `https://the-decoder.com/feed/`
17. **AI Business**
- URL: `https://aibusiness.com/rss.xml`
18. **Unite.AI**
- URL: `https://www.unite.ai/feed/`
19. **ScienceDaily AI**
- URL: `https://www.sciencedaily.com/rss/computers_math/artificial_intelligence.xml`
20. **The Guardian AI**
- URL: `https://www.theguardian.com/technology/artificialintelligenceai/rss`
21. **Reuters Technology**
- URL: `https://www.reutersagency.com/feed/?best-topics=tech`
22. **IEEE Spectrum AI**
- URL: `https://spectrum.ieee.org/feeds/topic/artificial-intelligence.rss`
---
## 📋 Implementation Tasks
### Phase 1: Emergency Fixes (Days 1-3)
- [ ] **Task 1.1:** Install `trafilatura` library
```bash
cd /home/ubuntu/.openclaw/workspace/burmddit/backend
pip3 install trafilatura readability-lxml
```
- [ ] **Task 1.2:** Create new `scraper_v2.py` with fallback methods
- [ ] Implement multi-method extraction
- [ ] Add user agent rotation
- [ ] Better error handling
- [ ] Retry logic with exponential backoff
- [ ] **Task 1.3:** Test each existing source manually
- [ ] Medium
- [ ] TechCrunch
- [ ] VentureBeat
- [ ] MIT Tech Review
- [ ] The Verge
- [ ] Wired
- [ ] Ars Technica
- [ ] Hacker News
- [ ] **Task 1.4:** Update `config.py` with working sources only
- [ ] **Task 1.5:** Run test pipeline
```bash
cd /home/ubuntu/.openclaw/workspace/burmddit/backend
python3 run_pipeline.py
```
### Phase 2: Add New Sources (Days 4-7)
- [ ] **Task 2.1:** Update `config.py` with 22 new RSS feeds
- [ ] **Task 2.2:** Test each new source individually
- [ ] Create `test_source.py` script
- [ ] Verify article quality
- [ ] Check extraction success rate
- [ ] **Task 2.3:** Categorize sources by reliability
- [ ] Tier 1: Official blogs (OpenAI, Anthropic, Google)
- [ ] Tier 2: News sites (TechCrunch, Verge)
- [ ] Tier 3: Aggregators (Reddit, HN)
- [ ] **Task 2.4:** Implement source health scoring
```python
# Track success rates per source
source_health = {
'openai': {'attempts': 100, 'success': 98, 'score': 0.98},
'medium': {'attempts': 100, 'success': 45, 'score': 0.45},
}
```
- [ ] **Task 2.5:** Auto-disable sources with <30% success rate
### Phase 3: Monitoring & Alerts (Days 8-10)
- [ ] **Task 3.1:** Create `monitor-pipeline.sh`
- [ ] Check articles scraped > 10
- [ ] Check pipeline runtime < 120 minutes
- [ ] Check latest article age < 24 hours
- [ ] **Task 3.2:** Set up heartbeat monitoring
- [ ] Add to `HEARTBEAT.md`
- [ ] Alert if pipeline fails 2 days in a row
- [ ] **Task 3.3:** Create weekly health report cron job
```python
# Weekly report: source stats, article counts, error rates
```
- [ ] **Task 3.4:** Dashboard for source health
- [ ] Show last 7 days of scraping stats
- [ ] Success rates per source
- [ ] Articles published per day
### Phase 4: Optimization (Days 11-14)
- [ ] **Task 4.1:** Parallel scraping
- [ ] Use `asyncio` or `multiprocessing`
- [ ] Reduce pipeline time from 90min → 30min
- [ ] **Task 4.2:** Smart article selection
- [ ] Prioritize trending topics
- [ ] Avoid duplicate content
- [ ] Better topic clustering
- [ ] **Task 4.3:** Image extraction improvements
- [ ] Better image quality filtering
- [ ] Fallback to AI-generated images
- [ ] Optimize image loading
- [ ] **Task 4.4:** Translation quality improvements
- [ ] A/B test different Claude prompts
- [ ] Add human review for top articles
- [ ] Build glossary of technical terms
---
## 🔔 Monitoring Setup
### Daily Checks (via Heartbeat)
Add to `HEARTBEAT.md`:
```markdown
## Burmddit Pipeline Health
**Check every 2nd heartbeat (every ~1 hour):**
1. Run: `/home/ubuntu/.openclaw/workspace/burmddit/scripts/check-pipeline-health.sh`
2. If articles_scraped < 10: Alert immediately
3. If pipeline failed: Check logs and report error
```
### Weekly Report (via Cron)
Already set up! Runs Wednesdays at 9 AM.
---
## 📈 Success Metrics
### Week 1 Targets
- ✅ 0 → 30+ articles scraped per day
- ✅ At least 5/8 existing sources working
- ✅ Pipeline completion success rate >80%
### Week 2 Targets
- ✅ 30 total sources active
- ✅ 50+ articles scraped per day
- ✅ Source health monitoring active
### Week 3 Targets
- ✅ 30-40 articles published per day
- ✅ Auto-recovery from errors
- ✅ Weekly reports sent automatically
### Month 1 Goals
- 🎯 1,200+ articles published (40/day avg)
- 🎯 Google AdSense eligible (1000+ articles)
- 🎯 10,000+ page views/month
---
## 🚨 Immediate Actions (Today)
1. **Install dependencies:**
```bash
pip3 install trafilatura readability-lxml fake-useragent
```
2. **Create scraper_v2.py** (see next file)
3. **Test manual scrape:**
```bash
python3 test_scraper.py --source openai --limit 5
```
4. **Fix and deploy by tomorrow morning** (before 1 AM UTC run)
---
## 📁 New Files to Create
1. `/backend/scraper_v2.py` - Improved scraper
2. `/backend/test_scraper.py` - Individual source tester
3. `/scripts/monitor-pipeline.sh` - Health check script
4. `/scripts/check-pipeline-health.sh` - Quick status check
5. `/scripts/source-health-report.py` - Weekly stats
---
**Next Step:** Create `scraper_v2.py` with robust fallback methods