Skip to content
Go back

Reddit Data Scraping with Python: A Simple Approach

5 min read

Introduction

As of March 2022, Reddit ranks as the 9th-most-visited website globally and 6th most-visited in the US according to Wikipedia.

What makes Reddit particularly interesting is its community-driven structure with subreddits catering to diverse interests—from programming and technology to funny and countless other topics.

There are several approaches to obtain data from Reddit. This post demonstrates one of the simplest methods that doesn’t require API keys.

The Basic Method: JSON Endpoints

The easiest way to extract Reddit data is to append .json to any Reddit URL. For example, instead of visiting https://www.reddit.com/r/programming, you can request https://www.reddit.com/r/programming.json to get the data in JSON format.

This approach has several advantages:

However, be aware that you may encounter rate limiting with excessive requests.

Here’s what the JSON response looks like:

Output of crawler

Figure 1: JSON output from a simple request using the .json extension

Building a Python Scraper

With this knowledge, we can create a Python script to automate the data extraction. While you could use any language, Python is particularly well-suited for this task due to its excellent HTTP libraries and JSON handling.

Here’s a complete implementation:

import json
import requests
REDDIT_URL: str = "https://www.reddit.com/r/programming.json?limit=100"
def process():
# Create headers to mimic a real browser
headers = {
"Connection": "keep-alive",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"Sec-Fetch-Site": "same-origin",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-User": "?1",
"Sec-Fetch-Dest": "document",
"Referer": "https://www.google.com/",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "en-US,en;q=0.9"
}
# Create a session with our headers
session = requests.Session()
session.headers = headers
try:
response = session.get(REDDIT_URL)
response.raise_for_status() # Raise an exception for bad status codes
except requests.RequestException as e:
print(f"Error fetching data: {e}")
return
# Extract posts from the JSON response
# The structure is: response.json()['data']['children']
posts = response.json()['data']['children']
results = []
urls = [] # We'll use these later to fetch comments
print(f"Processing {len(posts)} posts...")
for index, post in enumerate(posts, 1):
print(f"Processing post {index} of {len(posts)}")
post_data = post['data']
result = {
"title": post_data['title'],
"permalink": post_data['permalink'],
"name": post_data['name'],
"created": post_data['created_utc'],
"selftext": post_data['selftext']
}
# Build URL for fetching comments later
comment_url = f"https://www.reddit.com{post_data['permalink']}.json"
urls.append(comment_url)
results.append(result)
print(json.dumps(results, indent=4))
return results, urls
if __name__ == "__main__":
process()

Sample Output

When you run this script, you’ll get output like this:

[
{
"title": "NVIDIA Security Team: \"What if we just stopped using C?\" (This is not about Rust)",
"permalink": "/r/programming/comments/yoisjn/nvidia_security_team_what_if_we_just_stopped/",
"name": "t3_yoisjn",
"created": 1667817127.0,
"selftext": ""
}
]

Fetching Comments

The main listing provides permalinks to individual posts. You can use these to fetch comments by making additional requests. Here’s how to extend the script to fetch comments asynchronously:

from requests_futures.sessions import FuturesSession
from concurrent.futures import as_completed
def fetch_comments(urls, headers):
submissions = []
with FuturesSession(max_workers=30) as session:
session.headers = headers
futures = [session.get(url) for url in urls]
for future in as_completed(futures):
try:
response = future.result()
response.raise_for_status()
# Extract post title
post_data = response.json()[0]["data"]["children"][0]["data"]
title = post_data["title"]
# Extract comments
comments_data = response.json()[1]["data"]["children"]
replies = []
for comment in comments_data:
if comment["kind"] == "t1": # Regular comment
body = comment["data"]["body"]
replies.append(body)
submission = {
"title": title,
"replies": replies
}
submissions.append(submission)
except Exception as e:
print(f"Error processing comment: {e}")
continue
return submissions

Sample Comment Output

Here’s what the comment data looks like:

[
{
"title": "NVIDIA Security Team: \"What if we just stopped using C?\" (This is not about Rust)",
"replies": [
"That's pretty cool.\n\nThough I find it fascinating they didn't go for the low-hanging fruit of the way they do UI<->driver interactions and how many layers of vulnerabilities come from that, nevermind how \"heavyweight\" it all is.\n\nBut as far as the backend goes, that's a damn cool change, especially that it was accepted so well.",
"> What if we just stopped using C?\n\n> #504 Gateway Time-out\n\nis this some elaborate shitpost that is flying over my head?",
"I think reddit hugged it to death.",
"For those mentioning Rust. Nvidia do know about Rust, just in case you thought that they hadn't heard all about it. Some considerations like lacking integer overflow protection is talked about here.\n\nhttps://youtu.be/TcIaZ9LW1WE\n\nOn a personal note. I love Ada's readability. Some see readability as meaning brevity. I do not."
]
}
]

Important Considerations

Rate Limiting

This method doesn’t require an API key, but Reddit may rate-limit you if you make too many requests too quickly. Consider:

Data Storage

For production use, consider:

Complete Implementation

You can find the complete, production-ready implementation in this GitHub Gist.

Conclusion

This method provides the simplest way to extract Reddit data without requiring API keys. While suitable for basic needs and learning purposes, consider using Reddit’s official API for production applications that require reliability and higher rate limits.

Remember to be respectful of Reddit’s resources and always follow their terms of service when scraping data.


Share this post on:

Previous Post
Reddit AITA Visualization: Data Analysis Project
Next Post
Hacktoberfest 2022: Building a Word Cloud Generator