Files

Zeya Phyo f51ac4afa4 Add web admin features + fix scraper & translator

Frontend changes:
- Add /admin dashboard for article management
- Add AdminButton component (Alt+Shift+A on articles)
- Add /api/admin/article API endpoints

Backend improvements:
- scraper_v2.py: Multi-layer fallback extraction (newspaper → trafilatura → readability)
- translator_v2.py: Better chunking, repetition detection, validation
- admin_tools.py: CLI admin commands
- test_scraper.py: Individual source testing

Docs:
- WEB-ADMIN-GUIDE.md: Web admin usage
- ADMIN-GUIDE.md: CLI admin usage
- SCRAPER-IMPROVEMENT-PLAN.md: Scraper fixes details
- TRANSLATION-FIX.md: Translation improvements
- ADMIN-FEATURES-SUMMARY.md: Implementation summary

Fixes:
- Article scraping from 0 → 96+ articles working
- Translation quality issues (repetition, truncation)
- Added 13 new RSS sources

2026-02-26 09:17:50 +00:00

6.2 KiB

Raw Blame History

🚀 Burmddit: Next Steps (START HERE)

Created: 2026-02-26
Priority: 🔥 CRITICAL
Status: Action Required

🎯 The Problem

burmddit.com is broken:

❌ 0 articles scraped in the last 5 days
❌ Stuck at 87 articles (last update: Feb 21)
❌ All 8 news sources failing
❌ Pipeline runs daily but produces nothing

Root cause: newspaper3k library failures + scraping errors

✅ What I've Done (Last 30 minutes)

1. Research & Analysis

✅ Identified all scraper errors from logs
✅ Researched 100+ AI news RSS feeds
✅ Found 22 high-quality new sources to add

2. Planning Documents Created

✅ SCRAPER-IMPROVEMENT-PLAN.md - Detailed technical plan
✅ BURMDDIT-TASKS.md - Day-by-day task tracker
✅ NEXT-STEPS.md - This file (action plan)

3. Monitoring Scripts Created

✅ scripts/check-pipeline-health.sh - Quick health check
✅ scripts/source-stats.py - Source performance stats
✅ Updated HEARTBEAT.md - Auto-monitoring every 2 hours

🔥 What Needs to Happen Next (Priority Order)

TODAY (Next 4 hours)

1. Install dependencies (5 min)

cd /home/ubuntu/.openclaw/workspace/burmddit/backend
pip3 install trafilatura readability-lxml fake-useragent lxml_html_clean

2. Create improved scraper (2 hours)

File: backend/scraper_v2.py
Features:
- Multi-method extraction (newspaper → trafilatura → beautifulsoup)
- User agent rotation
- Better error handling
- Retry logic with exponential backoff

3. Test individual sources (1 hour)

Create test_source.py script
Test each of 8 existing sources
Identify which ones work

4. Update config (10 min)

Disable broken sources
Keep only working ones

5. Test run (90 min)

cd /home/ubuntu/.openclaw/workspace/burmddit/backend
python3 run_pipeline.py

Target: At least 10 articles scraped
If successful → deploy for tomorrow's cron

TOMORROW (Day 2)

Morning:

Check overnight cron results
Fix any new errors

Afternoon:

Add 5 high-priority new sources:
- OpenAI Blog
- Anthropic Blog
- Hugging Face Blog
- Google AI Blog
- MarkTechPost
Test evening run (target: 25+ articles)

DAY 3

Add remaining 17 new sources (30 total)
Full test with all sources
Verify monitoring works

DAYS 4-7 (If time permits)

Parallel scraping (reduce runtime 90min → 40min)
Source health scoring
Image extraction improvements
Translation quality enhancements

📋 Key Files to Review

Planning Docs

SCRAPER-IMPROVEMENT-PLAN.md - Full technical plan
- Current issues explained
- 22 new RSS sources listed
- Implementation details
- Success metrics
BURMDDIT-TASKS.md - Task tracker
- Day-by-day breakdown
- Checkboxes for tracking progress
- Daily checklist
- Success criteria

Code Files (To Be Created)

backend/scraper_v2.py - New scraper (URGENT)
backend/test_source.py - Source tester
scripts/check-pipeline-health.sh - Health monitor ✅ (done)
scripts/source-stats.py - Stats reporter ✅ (done)

Config Files

backend/config.py - Source configuration
backend/.env - Environment variables (API keys)

🎯 Success Criteria

Immediate (Today)

✅ At least 10 articles scraped in test run
✅ At least 3 sources working
✅ Pipeline completes without crashing

Day 3

✅ 30+ sources configured
✅ 40+ articles scraped per run
✅ <5% error rate

Week 1

✅ 30-40 articles published daily
✅ 25/30 sources active
✅ 95%+ pipeline success rate
✅ Automatic monitoring working

🚨 Critical Path

BLOCKER: Scraper must be fixed TODAY for tomorrow's 1 AM UTC cron run.

Timeline:

Now → +2h: Build scraper_v2.py
+2h → +3h: Test sources
+3h → +4.5h: Full pipeline test
+4.5h: Deploy if successful

If delayed, website stays broken for another day = lost traffic.

📊 New Sources to Add (Top 10)

These are the highest-quality sources to prioritize:

OpenAI Blog - https://openai.com/blog/rss/
Anthropic Blog - https://www.anthropic.com/rss
Hugging Face - https://huggingface.co/blog/feed.xml
Google AI - http://googleaiblog.blogspot.com/atom.xml
MarkTechPost - https://www.marktechpost.com/feed/
The Rundown AI - https://rss.beehiiv.com/feeds/2R3C6Bt5wj.xml
Last Week in AI - https://lastweekin.ai/feed
Analytics India Magazine - https://analyticsindiamag.com/feed/
AI News - https://www.artificialintelligence-news.com/feed/rss/
KDnuggets - https://www.kdnuggets.com/feed

(Full list of 22 sources in SCRAPER-IMPROVEMENT-PLAN.md)

🤖 Automatic Monitoring

I've set up automatic health checks:

Heartbeat monitoring (every 2 hours)
- Runs scripts/check-pipeline-health.sh
- Alerts if: zero articles, high errors, or stale pipeline
Daily checklist (9 AM Singapore time)
- Check overnight cron results
- Review errors
- Update task tracker
- Report status

You'll be notified automatically if:

Pipeline fails
Article count drops below 10
Error rate exceeds 50
No run in 36+ hours

💬 Questions to Decide

Should I start building scraper_v2.py now?
- Or do you want to review the plan first?
Do you want to add all 22 sources at once, or gradually?
- Recommendation: Start with top 10, then expand
Should I deploy the fix automatically or ask first?
- Recommendation: Test first, then ask before deploying
Priority: Speed or perfection?
- Option A: Quick fix (2-4 hours, basic functionality)
- Option B: Proper rebuild (1-2 days, all optimizations)

📞 Contact

Owner: Zeya Phyo
Developer: Bob
Deadline: ASAP (ideally today)

Current time: 2026-02-26 08:30 UTC (4:30 PM Singapore)

🚀 Ready to Start?

Recommended action: Let me start building scraper_v2.py now.

Command to kick off:

Yes, start fixing the scraper now

Or if you want to review the plan first:

Show me the technical details of scraper_v2.py first

All planning documents are ready. Just need your go-ahead to execute. 🎯

6.2 KiB Raw Blame History