Reddit is a forum-style social media platform used for various purposes, including content discovery, making friends, career advice, news aggregation, and more. This platform has a treasure of important data from niche interests to industry-specific insights. It is very beneficial for the researchers, analysts, and marketers. They can extract Reddit data to gain competitive insights for making their business more sustainable. In this comprehensive blog post, you will be able to learn how to extract data from Reddit.
Reddit contains a diverse number of interest-based communities and subjects. Scraped data from Reddit can be used for:
Reddit is packed with ideas and inspiration for creating content. This is a hidden gem that content writers can leverage by scraping data. With this opportunity, they can find trending topics and discussions to create concise and engaging content.
Reddit is one of the social media platforms where people share their opinions and emotions about numerous topics. You can scrape subreddits to perform sentiment analysis to know positive, negative, or neutral feelings about your services, products, or brand.
Scraping Reddit data provides comprehensive insights to identify customer preferences and needs to gain a clear picture of the current market scenario. It can also help in performing competitive analysis by extracting competitors’ data.
You can extract numerous types of data from Reddit. Some of the few are as follows:
Reddit has a robust API for developers. It helps developers pull out the data they want programmatically. Rediit’s API is one of the reliable and safe ways to use its data without OAuth authentication.
Python Reddit API Wrapper is a Python library. By accessing this Reddit API, you can easily scrape data from it. This API can be used to handle a large volume of queries.
Web scraping is suitable for those who wish to get customized data collection. By using Web scraping, comments, subreddits, and posts, comment profiles become easy. However, it is good practice to follow Reddit’s terms and conditions.
Organizations that don’t wish to create their own infrastructure from scratch can use the services of professional data providers, as they offer ready-to-use Reddit data scraper.
There are many ways to extract Reddit posts, subreddits, and profiles, but we will do it via Python, as it is a widely used programming language that is primarily used for web data scraping. It has very useful libraries that can be used in many programs. We are using PRAW to scrape Reddit data.
First of all, you have to install PRAW. This can be done by running the command mentioned below on the command prompt:
This will create a Reddit app. Now, use the combination of Python and PRAW to generate data from Reddit. You have to remember client_id & user_agent values. These will help us connect to Reddit.
Now, it’s time to create a PRAW instance and then connect it to Reddit. Let’s understand PRAW types in detail.
There are mainly two types of PRAW instances. 1) Read-only Instance and 2) Authorized Instance.
Write the following code in a Python editor:
reddit_read_only = praw.Reddit(client_id="",
client_secret="",
user_agent="")
reddit_authorized = praw.Reddit(client_id="",
client_secret="",
user_agent="",
username="",
password="")
Great! We have now created a PRAW Instance. The next step is to utilize Reddit’s API to extract data. Here in this blog, we will highlight the use of a read-only instance.
There are many ways to extract Reddit subreddits; the one that we are using will categorize subreddit posts as new, top, controversial, hot, and more.
Write the following code:
import praw
import pandas as pd
reddit_read_only = praw.Reddit(client_id="",
client_secret="",
user_agent="")
subreddit = reddit_read_only.subreddit("redditdev")
print("Display Name:", subreddit.display_name)
print("Title:", subreddit.title)
print("Description:", subreddit.description)
Now, we will extract 3 posts from the subreddit:
subreddit = reddit_read_only.subreddit("Python")
for post in subreddit.hot(limit=3):
print(post.title)
print()
Once you are done with it, we will save the post in a pandas data frame:
posts = subreddit.top("month")
posts_dict = {"Title": [], "Post Text": [],
"ID": [], "Score": [],
"Total Comments": [], "Post URL": []
}
for post in posts:
posts_dict["Title"].append(post.title)
posts_dict["Post Text"].append(post.selftext)
posts_dict["ID"].append(post.id)
posts_dict["Score"].append(post.score)
posts_dict["Total Comments"].append(post.num_comments)
posts_dict["Post URL"].append(post.url)
top_posts = pd.DataFrame(posts_dict)
top_posts
In order to extract the needed data from Reddit posts, you will need a post URL. After getting the URL, you have to create a submission object as shown in the following code snippet.
import praw
import pandas as pd
reddit_read_only = praw.Reddit(client_id="",
client_secret="",
user_agent="")
url = "https://www.reddit.com/r/IAmA/comments/m8n4vt/%5C/
im_bill_gates_cochair_of_the_bill_and_melinda/"
submission = reddit_read_only.submission(url=url)
Here we will extract the comments from the chosen post. To do so, we have to utilize the MoreComments object available in the PRAW module. We will use a for-loop on the submission object in order to extract the required comments. Now, these comments will be appended to the post_comment list. If you want to see whether any comment has the object type of more comments, then we can also add an if statement in the for loop. In case it performs well, then the Reddit post has more comments available. We will add them to our list. After this, we will convert the extracted list into a pandas data frame.
from praw.models import MoreComments
post_comments = []
for comment in submission.comments:
if type(comment) == MoreComments:
continue
post_comments.append(comment.body)
comments_df = pd.DataFrame(post_comments, columns=['comment'])
comments_df
In this section, we will see how Reddit profile pages can be scraped. So let's start. For extracting Reddit posts on profile pages, we are using old.reddit as it has a simpler structure and fewer ads. Old.reddit will make our profile scraping task easier.
We will use the following URL as an example:
https://old.reddit.com/user/scraping_intelligence/submitted?count=25&after=t3_191n6zm
The count query mentioned in the above URL shows the total number of results to render on the HTML page. Rest parameters control the pagination cursor. It is basically a post ID to start after.
Let’s integrate our logic into Python code:
import json
import asyncio
from typing import List, Dict, Union
from datetime import datetime
from httpx import AsyncClient, Response
from loguru import logger as log
from parsel import Selector
client = AsyncClient(
)
def parse_user_posts(response: Response) -> List[Dict]:
"""parse user posts from user profiles"""
selector = Selector(response.text)
data = []
for box in selector.xpath("//div[@id='siteTable']/div[contains(@class, 'thing')]"):
author = box.xpath("./@data-author").get()
link = box.xpath("./@data-permalink").get()
publishing_date = box.xpath("./@data-timestamp").get()
publishing_date = datetime.fromtimestamp(int(publishing_date) / 1000.0).strftime('%Y-%m-%dT%H:%M:%S.%f%z') if publishing_date else None
comment_count = box.xpath("./@data-comments-count").get()
post_score = box.xpath("./@data-score").get()
data.append({
"authorId": box.xpath("./@data-author-fullname").get(),
"author": author,
"authorProfile": "https://www.reddit.com/user/" + author if author else None,
"postId": box.xpath("./@data-fullname").get(),
"postLink": "https://www.reddit.com" + link if link else None,
"postTitle": box.xpath(".//p[@class='title']/a/text()").get(),
"postSubreddit": box.xpath("./@data-subreddit-prefixed").get(),
"publishingDate": publishing_date,
"commentCount": int(comment_count) if comment_count else None,
"postScore": int(post_score) if post_score else None,
"attachmentType": box.xpath("./@data-type").get(),
"attachmentLink": box.xpath("./@data-url").get(),
})
next_page_url = selector.xpath("//span[@class='next-button']/a/@href").get()
return {"data": data, "url": next_page_url}
async def scrape_user_posts(username: str, sort: Union["new", "top", "controversial"], max_pages: int = None) -> List[Dict]:
"""scrape user posts"""
url = f"https://old.reddit.com/user/{username}/submitted/?sort={sort}"
response = await client.get(url)
data = parse_user_posts(response)
post_data, next_page_url = data["data"], data["url"]
while next_page_url and (max_pages is None or max_pages > 0):
response = await client.get(next_page_url)
data = parse_user_posts(response)
next_page_url = data["url"]
post_data.extend(data["data"])
if max_pages is not None:
max_pages -= 1
log.success(f"scraped {len(post_data)} posts from the {username} reddit profile")
return post_data
Here, we define two functions for extracting Reddit profile posts: 1) parse_user_posts() and 2) scrape_user_posts().
We can scrape profile comments similar to Reddit profile posts we extracted previously. We just have to replace the starting URL and the logic for parsing:
def parse_user_comments(response: Response) -> List[Dict]:
"""parse user posts from user profiles"""
selector = Selector(response.text)
data = []
for box in selector.xpath("//div[@id='siteTable']/div[contains(@class, 'thing')]"):
author = box.xpath("./@data-author").get()
link = box.xpath("./@data-permalink").get()
dislikes = box.xpath(".//span[contains(@class, 'dislikes')]/@title").get()
upvotes = box.xpath(".//span[contains(@class, 'likes')]/@title").get()
downvotes = box.xpath(".//span[contains(@class, 'unvoted')]/@title").get()
data.append({
"authorId": box.xpath("./@data-author-fullname").get(),
"author": author,
"authorProfile": "https://www.reddit.com/user/" + author if author else None,
"commentId": box.xpath("./@data-fullname").get(),
"commentLink": "https://www.reddit.com" + link if link else None,
"commentBody": "".join(box.xpath(".//div[contains(@class, 'usertext-body')]/div/p/text()").getall()).replace("\n", ""),
"attachedCommentLinks": box.xpath(".//div[contains(@class, 'usertext-body')]/div/p/a/@href").getall(),
"publishingDate": box.xpath(".//time/@datetime").get(),
"dislikes": int(dislikes) if dislikes else None,
"upvotes": int(upvotes) if upvotes else None,
"downvotes": int(downvotes) if downvotes else None,
"replyTo": {
"postTitle": box.xpath(".//p[@class='parent']/a[@class='title']/text()").get(),
"postLink": "https://www.reddit.com" + box.xpath(".//p[@class='parent']/a[@class='title']/@href").get(),
"postAuthor": box.xpath(".//p[@class='parent']/a[contains(@class, 'author')]/text()").get(),
"postSubreddit": box.xpath("./@data-subreddit-prefixed").get(),
}
})
next_page_url = selector.xpath("//span[@class='next-button']/a/@href").get()
return {"data": data, "url": next_page_url}
async def scrape_user_comments(username: str, sort: Union["new", "top", "controversial"], max_pages: int = None) -> List[Dict]:
"""scrape user posts"""
url = f"https://old.reddit.com/user/{username}/comments/?sort={sort}"
return post_data
Scraping Reddit posts, subreddits, and Profiles can help businesses in many ways. It can be used by content writers for effective and engaging content, performing sentiment analysis to know feelings about products, services, and brands, or for market research for scraping competitors’ data. In this blog, we understand how data can be scraped from Reddit.
We highlighted ways to collect Reddit data using web scraping, Python API, third-party tools, and more. We learn how to use the Python Reddit API Wrapper for extracting Reddit Posts, subreddits, and profiles step-by-step. There can be many other ways to extract Reddit data; However, you have to choose the right approach to scrape Reddit data and gain valuable insights.
Explore our latest content pieces for every industry and audience seeking information about data scraping and advanced tools.
Our Python guide makes it easy to extract LinkedIn company data. Scraping Intelligence provides a step-by-step guide for mastering this skill.
Learn how to track real-time flight price changes using Web Scraping. Monitor fares, analyze trends, and find the best deals before booking flights.
Learn how to extract Reddit posts, subreddits, and profiles effectively using APIs, tools, and methods to collect accurate social data with ease.
Gain insights into the Top 5 US Fast Food Chains with data-driven analysis. Learn market trends, strategies & opportunities for small food businesses.