Files
burmddit/NEXT-STEPS.md
Zeya Phyo f51ac4afa4 Add web admin features + fix scraper & translator
Frontend changes:
- Add /admin dashboard for article management
- Add AdminButton component (Alt+Shift+A on articles)
- Add /api/admin/article API endpoints

Backend improvements:
- scraper_v2.py: Multi-layer fallback extraction (newspaper → trafilatura → readability)
- translator_v2.py: Better chunking, repetition detection, validation
- admin_tools.py: CLI admin commands
- test_scraper.py: Individual source testing

Docs:
- WEB-ADMIN-GUIDE.md: Web admin usage
- ADMIN-GUIDE.md: CLI admin usage
- SCRAPER-IMPROVEMENT-PLAN.md: Scraper fixes details
- TRANSLATION-FIX.md: Translation improvements
- ADMIN-FEATURES-SUMMARY.md: Implementation summary

Fixes:
- Article scraping from 0 → 96+ articles working
- Translation quality issues (repetition, truncation)
- Added 13 new RSS sources
2026-02-26 09:17:50 +00:00

6.2 KiB

🚀 Burmddit: Next Steps (START HERE)

Created: 2026-02-26
Priority: 🔥 CRITICAL
Status: Action Required


🎯 The Problem

burmddit.com is broken:

  • 0 articles scraped in the last 5 days
  • Stuck at 87 articles (last update: Feb 21)
  • All 8 news sources failing
  • Pipeline runs daily but produces nothing

Root cause: newspaper3k library failures + scraping errors


What I've Done (Last 30 minutes)

1. Research & Analysis

  • Identified all scraper errors from logs
  • Researched 100+ AI news RSS feeds
  • Found 22 high-quality new sources to add

2. Planning Documents Created

  • SCRAPER-IMPROVEMENT-PLAN.md - Detailed technical plan
  • BURMDDIT-TASKS.md - Day-by-day task tracker
  • NEXT-STEPS.md - This file (action plan)

3. Monitoring Scripts Created

  • scripts/check-pipeline-health.sh - Quick health check
  • scripts/source-stats.py - Source performance stats
  • Updated HEARTBEAT.md - Auto-monitoring every 2 hours

🔥 What Needs to Happen Next (Priority Order)

TODAY (Next 4 hours)

1. Install dependencies (5 min)

cd /home/ubuntu/.openclaw/workspace/burmddit/backend
pip3 install trafilatura readability-lxml fake-useragent lxml_html_clean

2. Create improved scraper (2 hours)

  • File: backend/scraper_v2.py
  • Features:
    • Multi-method extraction (newspaper → trafilatura → beautifulsoup)
    • User agent rotation
    • Better error handling
    • Retry logic with exponential backoff

3. Test individual sources (1 hour)

  • Create test_source.py script
  • Test each of 8 existing sources
  • Identify which ones work

4. Update config (10 min)

  • Disable broken sources
  • Keep only working ones

5. Test run (90 min)

cd /home/ubuntu/.openclaw/workspace/burmddit/backend
python3 run_pipeline.py
  • Target: At least 10 articles scraped
  • If successful → deploy for tomorrow's cron

TOMORROW (Day 2)

Morning:

  • Check overnight cron results
  • Fix any new errors

Afternoon:

  • Add 5 high-priority new sources:
    • OpenAI Blog
    • Anthropic Blog
    • Hugging Face Blog
    • Google AI Blog
    • MarkTechPost
  • Test evening run (target: 25+ articles)

DAY 3

  • Add remaining 17 new sources (30 total)
  • Full test with all sources
  • Verify monitoring works

DAYS 4-7 (If time permits)

  • Parallel scraping (reduce runtime 90min → 40min)
  • Source health scoring
  • Image extraction improvements
  • Translation quality enhancements

📋 Key Files to Review

Planning Docs

  1. SCRAPER-IMPROVEMENT-PLAN.md - Full technical plan

    • Current issues explained
    • 22 new RSS sources listed
    • Implementation details
    • Success metrics
  2. BURMDDIT-TASKS.md - Task tracker

    • Day-by-day breakdown
    • Checkboxes for tracking progress
    • Daily checklist
    • Success criteria

Code Files (To Be Created)

  1. backend/scraper_v2.py - New scraper (URGENT)
  2. backend/test_source.py - Source tester
  3. scripts/check-pipeline-health.sh - Health monitor (done)
  4. scripts/source-stats.py - Stats reporter (done)

Config Files

  1. backend/config.py - Source configuration
  2. backend/.env - Environment variables (API keys)

🎯 Success Criteria

Immediate (Today)

  • At least 10 articles scraped in test run
  • At least 3 sources working
  • Pipeline completes without crashing

Day 3

  • 30+ sources configured
  • 40+ articles scraped per run
  • <5% error rate

Week 1

  • 30-40 articles published daily
  • 25/30 sources active
  • 95%+ pipeline success rate
  • Automatic monitoring working

🚨 Critical Path

BLOCKER: Scraper must be fixed TODAY for tomorrow's 1 AM UTC cron run.

Timeline:

  • Now → +2h: Build scraper_v2.py
  • +2h → +3h: Test sources
  • +3h → +4.5h: Full pipeline test
  • +4.5h: Deploy if successful

If delayed, website stays broken for another day = lost traffic.


📊 New Sources to Add (Top 10)

These are the highest-quality sources to prioritize:

  1. OpenAI Blog - https://openai.com/blog/rss/
  2. Anthropic Blog - https://www.anthropic.com/rss
  3. Hugging Face - https://huggingface.co/blog/feed.xml
  4. Google AI - http://googleaiblog.blogspot.com/atom.xml
  5. MarkTechPost - https://www.marktechpost.com/feed/
  6. The Rundown AI - https://rss.beehiiv.com/feeds/2R3C6Bt5wj.xml
  7. Last Week in AI - https://lastweekin.ai/feed
  8. Analytics India Magazine - https://analyticsindiamag.com/feed/
  9. AI News - https://www.artificialintelligence-news.com/feed/rss/
  10. KDnuggets - https://www.kdnuggets.com/feed

(Full list of 22 sources in SCRAPER-IMPROVEMENT-PLAN.md)


🤖 Automatic Monitoring

I've set up automatic health checks:

  • Heartbeat monitoring (every 2 hours)

    • Runs scripts/check-pipeline-health.sh
    • Alerts if: zero articles, high errors, or stale pipeline
  • Daily checklist (9 AM Singapore time)

    • Check overnight cron results
    • Review errors
    • Update task tracker
    • Report status

You'll be notified automatically if:

  • Pipeline fails
  • Article count drops below 10
  • Error rate exceeds 50
  • No run in 36+ hours

💬 Questions to Decide

  1. Should I start building scraper_v2.py now?

    • Or do you want to review the plan first?
  2. Do you want to add all 22 sources at once, or gradually?

    • Recommendation: Start with top 10, then expand
  3. Should I deploy the fix automatically or ask first?

    • Recommendation: Test first, then ask before deploying
  4. Priority: Speed or perfection?

    • Option A: Quick fix (2-4 hours, basic functionality)
    • Option B: Proper rebuild (1-2 days, all optimizations)

📞 Contact

Owner: Zeya Phyo
Developer: Bob
Deadline: ASAP (ideally today)

Current time: 2026-02-26 08:30 UTC (4:30 PM Singapore)


🚀 Ready to Start?

Recommended action: Let me start building scraper_v2.py now.

Command to kick off:

Yes, start fixing the scraper now

Or if you want to review the plan first:

Show me the technical details of scraper_v2.py first

All planning documents are ready. Just need your go-ahead to execute. 🎯