Github Dbeley Reddit Scraper Various Scripts To Download

Elena Vance
-
github dbeley reddit scraper various scripts to download

Building the Ultimate Reddit Scraper: A Full-Featured, API-Free Data Collection Suite December 2024 | By Sanjeev Kumar TL;DR I built a complete Reddit scraper suite that requires zero API keys. It comes with a beautiful Streamlit dashboard, REST API for integration with tools like Grafana and Metabase, plugin system for post-processing, scheduled scraping, notifications, and much more. Best of all—it’s completely open source.

🔗 GitHub: reddit-universal-scraper The Problem If you’ve ever tried to scrape Reddit data for analysis, research, or just personal projects, you know the pain: - Reddit’s API is heavily rate-limited (especially after the 2023 API changes) - API keys require approval and are increasingly restricted - Existing scrapers are often single-purpose - scrape posts OR comments, not both - No easy way to visualize or analyze the data after scraping - Running scrapes manually is tedious - you want automation I decided to solve all of these problems at once.

________________________________________ The Solution: Universal Reddit Scraper Suite After weeks of development, I created a full-featured scraper that: Feature What It Does 📊 Full Scraping Posts, comments, images, videos, galleries—everything 🚫 No API Keys Uses Reddit’s public JSON endpoints and mirrors 📈 Web Dashboard Beautiful 7-tab Streamlit UI for analysis 🚀 REST API Connect Metabase, Grafana, DuckDB, and more 🔌 Plugin System Extensible post-processing (sentiment analysis, deduplication, keywords) 📅 Scheduled Scraping Cron-style automation 📧 Notifications Discord & Telegram alerts when scrapes complete 🐳 Docker Ready One command to deploy anywhere ________________________________________ Architecture Deep Dive How It Works Without API Keys The secret sauce is in the approach.

Instead of using Reddit’s official (and restricted) API, I leverage: - Reddit’s public JSON endpoints: Every Reddit page has a .json suffix that returns structured data - Multiple mirror fallbacks: When one source is rate-limited, the scraper automatically rotates through alternatives like Redlib instances - Smart rate limiting: Built-in delays and cool-down periods to stay under the radar MIRRORS = [ "https://old.reddit.com", "https://redlib.catsarch.com", "https://redlib.vsls.cz", "https://r.nf", "https://libreddit.northboot.xyz", "https://redlib.tux.pizza" ] When one source fails, it automatically tries the next. No manual intervention needed.

The Core Scraping Engine The scraper operates in three modes: - Full Mode - The complete package python main.py python --mode full --limit 100 This scrapes posts, downloads all media (images, videos, galleries), and fetches comments with their full thread hierarchy. - History Mode - Fast metadata-only python main.py python --mode history --limit 500 Perfect for quickly building a dataset of post metadata without the overhead of media downloads. - Monitor Mode - Live watching python main.py python --mode monitor Continuously checks for new posts every 5 minutes.

The Dashboard Experience One of the standout features is the 7-tab Streamlit dashboard that makes data exploration a joy: 📊 Overview Tab At a glance, see: - Total posts and comments - Cumulative score across all posts - Media post breakdown - Posts-over-time chart - Top 10 posts by score 📈 Analytics Tab This is where it gets interesting: - Sentiment Analysis: Run VADER-based sentiment scoring on your entire dataset - Keyword Cloud: See the most frequently used terms - Best Posting Times: Data-driven insights on when posts get the most engagement 🔍 Search Tab Full-text search across all scraped data with filters for: - Minimum score - Post type (text, image, video, gallery, link) - Author - Custom sorting 💬 Comments Analysis • View top-scoring comments • See who the most active commenters are • Track comment patterns over time ⚙️ Scraper Controls Start new scrapes right from the dashboard!

Configure: - Target subreddit/user - Post limits - Mode (full/history) - Media and comment toggles 📋 Job History Full observability into every scrape job: - Status tracking (running, completed, failed) - Duration metrics - Post/comment/media counts - Error logging 🔌 Integrations Pre-configured instructions for connecting: - Metabase - Grafana - DreamFactory - DuckDB The Plugin Architecture I designed a plugin system to allow extensible post-processing.

The architecture is simple but powerful: class Plugin: """Base class for all plugins.""" name = "base" description = "Base plugin" enabled = Truedef process_posts(self, posts): return postsdef process_comments(self, comments): return comments Built-in Plugins - Sentiment Tagger Analyzes the emotional tone of every post and comment using VADER sentiment analysis: class SentimentTagger(Plugin): name = "sentiment_tagger" description = "Adds sentiment scores and labels to posts"def process_posts(self, posts): for post in posts: text = f"{post.get('title', '')} {post.get('selftext', '')}" score, label = analyze_sentiment(text) post['sentiment_score'] = score post['sentiment_label'] = label return posts Deduplicator Removes duplicate posts that may appear across multiple scraping sessions.

Creating Your Own Plugin Drop a new Python file in the plugins/ directory: from plugins import Plugin class MyCustomPlugin(Plugin): name = "my_plugin" description = "Does something cool" enabled = True def process_posts(self, posts): # Your logic here return posts Enable plugins during scraping: python main.py python --mode full --plugins REST API for External Integrations The REST API opens up the scraper to a whole ecosystem of tools: python main.py --api API at http://localhost:8000 Docs at http://localhost:8000/docs Key Endpoints Endpoint Description GET /posts List posts with filters (subreddit, limit, offset) GET /comments List comments GET /subreddits All scraped subreddits GET /jobs Job history GET /query?sql=...

Raw SQL queries for power users GET /grafana/query Grafana-compatible time-series data Real-World Integration: Grafana Dashboard - Install the “JSON API” or “Infinity” plugin in Grafana - Add datasource pointing to http://localhost:8000 - Use the /grafana/query endpoint for time-series panels SELECT date(created_utc) as time, COUNT(*) as posts FROM posts GROUP BY date(created_utc) Now you have a real-time dashboard tracking Reddit activity!

________________________________________ Scheduled Scraping & Notifications Automation Made Easy Set up recurring scrapes with cron-style scheduling: # Scrape every 60 minutes python main.py --schedule delhi --every 60 With custom options python main.py --schedule delhi --every 30 --mode full --limit 50 Get Notified Configure Discord or Telegram alerts when scrapes complete: Environment variables export DISCORD_WEBHOOK_URL="https://discord.com/api/webhooks/..." export TELEGRAM_BOT_TOKEN="123456:ABC..." export TELEGRAM_CHAT_ID="987654321" Now you get notified with scrape summaries directly in your preferred platform. Dry Run Mode: Test Before You Commit One of my favorite features is dry run mode.

It simulates the entire scrape without saving any data: python main.py python --mode full --limit 50 --dry-run Output: 🧪 DRY RUN MODE - No data will be saved 🧪 DRY RUN COMPLETE! 📊 Would scrape: 100 posts 💬 Would scrape: 245 comments Perfect for: - Testing your scrape configuration - Estimating data volume before committing - Debugging without cluttering your dataset Docker Deployment Quick Start Build docker build -t reddit-scraper .

Run a scrape docker run -v ./data:/app/data reddit-scraper python --limit 100 Run with plugins docker run -v ./data:/app/data reddit-scraper python --plugins Full Stack with Docker Compose docker-compose up -d This spins up: - Dashboard at http://localhost:8501 - REST API at http://localhost:8000 Deploy to Any VPS ssh user@your-server-ip git clone https://github.com/ksanjeev284/reddit-universal-scraper.git cd reddit-universal-scraper docker-compose up -d Open the firewall: sudo ufw allow 8000 sudo ufw allow 8501 You now have a production-ready Reddit scraping platform!

Data Export Options CSV (Default) All scraped data is saved as CSV files: - data/r_/posts.csv - data/r_/comments.csv Parquet (Analytics-Optimized) Export to columnar format for analytics tools: python main.py --export-parquet python Query directly with DuckDB: import duckdb duckdb.query("SELECT * FROM 'data/parquet/*.parquet'").df() Database Maintenance Backup python main.py --backup Optimize/vacuum python main.py --vacuum View job history python main.py --job-history Data Schema Posts Table Column Description id Reddit post ID title Post title author Username score Net upvotes num_comments Comment count post_type text/image/video/gallery/link selftext Post body (for text posts) created_utc Timestamp permalink Reddit URL is_nsfw NSFW flag flair Post flair sentiment_score -1.0 to 1.0 (with plugins) Comments Table Column Description comment_id Comment ID post_permalink Parent post URL author Username body Comment text score Upvotes depth Nesting level is_submitter Whether author is OP Use Cases - Academic Research • Analyze subreddit community dynamics • Track sentiment over time during events • Study user engagement patterns - Market Research • Monitor brand mentions • Track product feedback • Identify emerging trends - Content Creation • Find popular topics in your niche • Analyze what makes posts go viral • Discover optimal posting times - Data Journalism • Archive discussions around breaking news • Analyze public sentiment during events • Track narrative evolution - Personal Projects • Build a dataset for ML training • Create Reddit-based recommendation systems • Archive communities you care about ________________________________________ Performance Considerations Respect Reddit’s Servers The scraper includes built-in delays: - 3 second cooldown between API requests - 30 second wait if all mirrors fail - Automatic mirror rotation to distribute load Optimize Your Scrapes • Use --mode history for faster metadata-only scrapes • Use --no-media if you don’t need images/videos • Use --no-comments for post-only data Handle Large Datasets • Parquet export for analytics queries • SQLite database for structured storage • Automatic deduplication to avoid bloat ________________________________________ What’s Next?

Roadmap I’m actively developing new features: • ☐ Async scraping for even faster data collection • ☐ Multi-subreddit monitoring in a single command • ☐ Email notifications in addition to Discord/Telegram • ☐ Cloud deployment templates (AWS, GCP, Azure) • ☐ Web-based scraper configuration (no CLI needed) ________________________________________ Getting Started Prerequisites • Python 3.10+ • pip Installation # Clone the repo git clone https://github.com/ksanjeev284/reddit-universal-scraper.git cd reddit-universal-scraper Install dependencies pip install -r requirements.txt Your first scrape python main.py python --mode full --limit 50 Launch the dashboard python main.py --dashboard That’s it!

You’re now scraping Reddit like a pro. Contributing This is an open-source project and contributions are welcome! Whether it’s: - Bug fixes - New plugins - Documentation improvements - Feature suggestions Open an issue or submit a PR on GitHub. If you found this useful, consider giving the project a ⭐ on GitHub! Connect • GitHub: @ksanjeev284 • Project: reddit-universal-scraper Top comments (1) We loved your post so we shared it on social. Keep up the great work! Some comments may only be visible to logged-in visitors.

People Also Asked

GitHub - dbeley/reddit-scraper: Various scripts to download ...?

🔗 GitHub: reddit-universal-scraper The Problem If you’ve ever tried to scrape Reddit data for analysis, research, or just personal projects, you know the pain: - Reddit’s API is heavily rate-limited (especially after the 2023 API changes) - API keys require approval and are increasingly restricted - Existing scrapers are often single-purpose - scrape posts OR comments, not both - No easy way to v...

Releases · dbeley/reddit-scraper · GitHub?

Run a scrape docker run -v ./data:/app/data reddit-scraper python --limit 100 Run with plugins docker run -v ./data:/app/data reddit-scraper python --plugins Full Stack with Docker Compose docker-compose up -d This spins up: - Dashboard at http://localhost:8501 - REST API at http://localhost:8000 Deploy to Any VPS ssh user@your-server-ip git clone https://github.com/ksanjeev284/reddit-universal-sc...

My full-featured Reddit Scraper. Eliminate duplicates, log ...?

Building the Ultimate Reddit Scraper: A Full-Featured, API-Free Data Collection Suite December 2024 | By Sanjeev Kumar TL;DR I built a complete Reddit scraper suite that requires zero API keys. It comes with a beautiful Streamlit dashboard, REST API for integration with tools like Grafana and Metabase, plugin system for post-processing, scheduled scraping, notifications, and much more. Best of all...

Building the Ultimate Reddit Scraper: A Full-Featured, API ...?

Building the Ultimate Reddit Scraper: A Full-Featured, API-Free Data Collection Suite December 2024 | By Sanjeev Kumar TL;DR I built a complete Reddit scraper suite that requires zero API keys. It comes with a beautiful Streamlit dashboard, REST API for integration with tools like Grafana and Metabase, plugin system for post-processing, scheduled scraping, notifications, and much more. Best of all...

Scraping Reddit using Python - GeeksforGeeks?

Creating Your Own Plugin Drop a new Python file in the plugins/ directory: from plugins import Plugin class MyCustomPlugin(Plugin): name = "my_plugin" description = "Does something cool" enabled = True def process_posts(self, posts): # Your logic here return posts Enable plugins during scraping: python main.py python --mode full --plugins REST API for External Integrations The REST API opens up th...