Table Of Content

How to Extract Reddit Posts, Subreddits, and Profiles Effectively

extract-reddit-profiles-posts-subreddits

Publish Date

Sep 05, 2025

Author

Scraping Intelligence

Reddit is a forum-style social media platform used for various purposes, including content discovery, making friends, career advice, news aggregation, and more. This platform has a treasure of important data from niche interests to industry-specific insights. It is very beneficial for the researchers, analysts, and marketers. They can extract Reddit data to gain competitive insights for making their business more sustainable. In this comprehensive blog post, you will be able to learn how to extract data from Reddit.

Why Scrape Reddit Data?

Reddit contains a diverse number of interest-based communities and subjects. Scraped data from Reddit can be used for:

Content Creation

Reddit is packed with ideas and inspiration for creating content. This is a hidden gem that content writers can leverage by scraping data. With this opportunity, they can find trending topics and discussions to create concise and engaging content.

For Sentiment Analysis

Reddit is one of the social media platforms where people share their opinions and emotions about numerous topics. You can scrape subreddits to perform sentiment analysis to know positive, negative, or neutral feelings about your services, products, or brand.

For Market Research

Scraping Reddit data provides comprehensive insights to identify customer preferences and needs to gain a clear picture of the current market scenario. It can also help in performing competitive analysis by extracting competitors’ data.

What Data Can You Scrape From Reddit?

You can extract numerous types of data from Reddit. Some of the few are as follows:

Post titles and content
Subreddit and topics
Number of upvotes and downvotes
Usernames, profiles, karma scores, etc.
Comments and replies
Images, videos, and other media files
Creation time of posts and comments

Various Ways to Extract Reddit Data

Using Reddit’s API

Reddit has a robust API for developers. It helps developers pull out the data they want programmatically. Rediit’s API is one of the reliable and safe ways to use its data without OAuth authentication.

Third-Party Tools

Python Reddit API Wrapper is a Python library. By accessing this Reddit API, you can easily scrape data from it. This API can be used to handle a large volume of queries.

Web Scraping

Web scraping is suitable for those who wish to get customized data collection. By using Web scraping, comments, subreddits, and posts, comment profiles become easy. However, it is good practice to follow Reddit’s terms and conditions.

Data Providers

Organizations that don’t wish to create their own infrastructure from scratch can use the services of professional data providers, as they offer ready-to-use Reddit data scraper.

Steps to Extract Reddit Data

There are many ways to extract Reddit posts, subreddits, and profiles, but we will do it via Python, as it is a widely used programming language that is primarily used for web data scraping. It has very useful libraries that can be used in many programs. We are using PRAW to scrape Reddit data.

First of all, you have to install PRAW. This can be done by running the command mentioned below on the command prompt:

Creating a Reddit App

Step 1: Reddit data can be extracted very easily by creating a Reddit app. Type https://www.reddit.com/prefs/apps in the browser and press enter.
Step 2: Click on the "are you a developer? create an app..." button.
Step 3: One form will appear on your screen. Enter your desired name and description.
Step 4: Now, enter http://localhost:8080 in the browser.
Step 5: Click on the "create app" button.

This will create a Reddit app. Now, use the combination of Python and PRAW to generate data from Reddit. You have to remember client_id & user_agent values. These will help us connect to Reddit.

Creating a PRAW Instance

Now, it’s time to create a PRAW instance and then connect it to Reddit. Let’s understand PRAW types in detail.

There are mainly two types of PRAW instances. 1) Read-only Instance and 2) Authorized Instance.

Read-only Instance: This is an instance for scraping publicly available Reddit data.
Authorized Instance: With this Python instance, you will be able to do anything with your Reddit account.

Write the following code in a Python editor:

reddit_read_only = praw.Reddit(client_id="",          
                               client_secret="",       
                               user_agent="")         

reddit_authorized = praw.Reddit(client_id="",          
                                client_secret="",      
                                user_agent="",        
                                username="", 
                                password="")

Great! We have now created a PRAW Instance. The next step is to utilize Reddit’s API to extract data. Here in this blog, we will highlight the use of a read-only instance.

Scraping Reddit Subreddits

There are many ways to extract Reddit subreddits; the one that we are using will categorize subreddit posts as new, top, controversial, hot, and more.

Write the following code:

import praw
import pandas as pd

reddit_read_only = praw.Reddit(client_id="",         
                               client_secret="",      
                               user_agent="")        

subreddit = reddit_read_only.subreddit("redditdev")

print("Display Name:", subreddit.display_name)

print("Title:", subreddit.title)

print("Description:", subreddit.description)

Now, we will extract 3 posts from the subreddit: 

subreddit = reddit_read_only.subreddit("Python")

for post in subreddit.hot(limit=3):
    print(post.title)
    print()

Once you are done with it, we will save the post in a pandas data frame:

posts = subreddit.top("month")

posts_dict = {"Title": [], "Post Text": [],
              "ID": [], "Score": [],
              "Total Comments": [], "Post URL": []
              }

for post in posts:
    posts_dict["Title"].append(post.title)
    
    posts_dict["Post Text"].append(post.selftext)
    
    posts_dict["ID"].append(post.id)

    posts_dict["Score"].append(post.score)

    posts_dict["Total Comments"].append(post.num_comments)
    
    posts_dict["Post URL"].append(post.url)

top_posts = pd.DataFrame(posts_dict)
top_posts

Scraping Reddit Posts

In order to extract the needed data from Reddit posts, you will need a post URL. After getting the URL, you have to create a submission object as shown in the following code snippet.

import praw
import pandas as pd

reddit_read_only = praw.Reddit(client_id="", 
                               client_secret="",       
                               user_agent="")         

url = "https://www.reddit.com/r/IAmA/comments/m8n4vt/%5C/
im_bill_gates_cochair_of_the_bill_and_melinda/"

submission = reddit_read_only.submission(url=url)

Here we will extract the comments from the chosen post. To do so, we have to utilize the MoreComments object available in the PRAW module. We will use a for-loop on the submission object in order to extract the required comments. Now, these comments will be appended to the post_comment list. If you want to see whether any comment has the object type of more comments, then we can also add an if statement in the for loop. In case it performs well, then the Reddit post has more comments available. We will add them to our list. After this, we will convert the extracted list into a pandas data frame.

from praw.models import MoreComments

post_comments = []

for comment in submission.comments:
    if type(comment) == MoreComments:
        continue

    post_comments.append(comment.body)

comments_df = pd.DataFrame(post_comments, columns=['comment'])
comments_df

Scraping Reddit Profiles Posts

In this section, we will see how Reddit profile pages can be scraped. So let's start. For extracting Reddit posts on profile pages, we are using old.reddit as it has a simpler structure and fewer ads. Old.reddit will make our profile scraping task easier.

We will use the following URL as an example:

https://old.reddit.com/user/scraping_intelligence/submitted?count=25&after=t3_191n6zm

The count query mentioned in the above URL shows the total number of results to render on the HTML page. Rest parameters control the pagination cursor. It is basically a post ID to start after.

Let’s integrate our logic into Python code:

import json
import asyncio
from typing import List, Dict, Union
from datetime import datetime
from httpx import AsyncClient, Response
from loguru import logger as log
from parsel import Selector

client = AsyncClient(
)

def parse_user_posts(response: Response) -> List[Dict]:
    """parse user posts from user profiles"""
    selector = Selector(response.text)
    data = []
    for box in selector.xpath("//div[@id='siteTable']/div[contains(@class, 'thing')]"):
        author = box.xpath("./@data-author").get()
        link = box.xpath("./@data-permalink").get()
        publishing_date = box.xpath("./@data-timestamp").get()
        publishing_date = datetime.fromtimestamp(int(publishing_date) / 1000.0).strftime('%Y-%m-%dT%H:%M:%S.%f%z') if publishing_date else None
        comment_count = box.xpath("./@data-comments-count").get()
        post_score = box.xpath("./@data-score").get() 
        data.append({
            "authorId": box.xpath("./@data-author-fullname").get(),
            "author": author,
            "authorProfile": "https://www.reddit.com/user/" + author if author else None,
            "postId": box.xpath("./@data-fullname").get(),
            "postLink": "https://www.reddit.com" + link if link else None,
            "postTitle": box.xpath(".//p[@class='title']/a/text()").get(),
            "postSubreddit": box.xpath("./@data-subreddit-prefixed").get(),
            "publishingDate": publishing_date,
            "commentCount": int(comment_count) if comment_count else None,
            "postScore": int(post_score) if post_score else None,
            "attachmentType": box.xpath("./@data-type").get(),
            "attachmentLink": box.xpath("./@data-url").get(),
        })
    next_page_url = selector.xpath("//span[@class='next-button']/a/@href").get()
    return {"data": data, "url": next_page_url}


async def scrape_user_posts(username: str, sort: Union["new", "top", "controversial"], max_pages: int = None) -> List[Dict]:
    """scrape user posts"""
    url = f"https://old.reddit.com/user/{username}/submitted/?sort={sort}"
    response = await client.get(url)
    data = parse_user_posts(response)
    post_data, next_page_url = data["data"], data["url"]

    while next_page_url and (max_pages is None or max_pages > 0):
        response = await client.get(next_page_url)
        data = parse_user_posts(response)
        next_page_url = data["url"]
        post_data.extend(data["data"])
        if max_pages is not None:
            max_pages -= 1
    log.success(f"scraped {len(post_data)} posts from the {username} reddit profile")
    return post_data

Here, we define two functions for extracting Reddit profile posts: 1) parse_user_posts() and 2) scrape_user_posts().

parse_user_posts(): This method will parse the available posts’ data on the HTML with the use of XPath selectors.
scrape_user_posts(): This function will request the user's post page by using the parsed link of the next page button available for pagination, as long as it is available.

Scraping Profile Comments

We can scrape profile comments similar to Reddit profile posts we extracted previously. We just have to replace the starting URL and the logic for parsing:

def parse_user_comments(response: Response) -> List[Dict]:
    """parse user posts from user profiles"""
    selector = Selector(response.text)
    data = []
    for box in selector.xpath("//div[@id='siteTable']/div[contains(@class, 'thing')]"):
        author = box.xpath("./@data-author").get()
        link = box.xpath("./@data-permalink").get()
        dislikes = box.xpath(".//span[contains(@class, 'dislikes')]/@title").get()
        upvotes = box.xpath(".//span[contains(@class, 'likes')]/@title").get()
        downvotes = box.xpath(".//span[contains(@class, 'unvoted')]/@title").get()
        data.append({
            "authorId": box.xpath("./@data-author-fullname").get(),
            "author": author,
            "authorProfile": "https://www.reddit.com/user/" + author if author else None,
            "commentId": box.xpath("./@data-fullname").get(),
            "commentLink": "https://www.reddit.com" + link if link else None,
            "commentBody": "".join(box.xpath(".//div[contains(@class, 'usertext-body')]/div/p/text()").getall()).replace("\n", ""),
            "attachedCommentLinks": box.xpath(".//div[contains(@class, 'usertext-body')]/div/p/a/@href").getall(),
            "publishingDate": box.xpath(".//time/@datetime").get(),
            "dislikes": int(dislikes) if dislikes else None,
            "upvotes": int(upvotes) if upvotes else None,
            "downvotes": int(downvotes) if downvotes else None,
            "replyTo": {
                "postTitle": box.xpath(".//p[@class='parent']/a[@class='title']/text()").get(),
                "postLink": "https://www.reddit.com" + box.xpath(".//p[@class='parent']/a[@class='title']/@href").get(),
                "postAuthor": box.xpath(".//p[@class='parent']/a[contains(@class, 'author')]/text()").get(),
                "postSubreddit": box.xpath("./@data-subreddit-prefixed").get(),    
            }
        })
    next_page_url = selector.xpath("//span[@class='next-button']/a/@href").get()
    return {"data": data, "url": next_page_url}


async def scrape_user_comments(username: str, sort: Union["new", "top", "controversial"], max_pages: int = None) -> List[Dict]:
    """scrape user posts"""
    url = f"https://old.reddit.com/user/{username}/comments/?sort={sort}"
    return post_data

Conclusion

Scraping Reddit posts, subreddits, and Profiles can help businesses in many ways. It can be used by content writers for effective and engaging content, performing sentiment analysis to know feelings about products, services, and brands, or for market research for scraping competitors’ data. In this blog, we understand how data can be scraped from Reddit.

We highlighted ways to collect Reddit data using web scraping, Python API, third-party tools, and more. We learn how to use the Python Reddit API Wrapper for extracting Reddit Posts, subreddits, and profiles step-by-step. There can be many other ways to extract Reddit data; However, you have to choose the right approach to scrape Reddit data and gain valuable insights.

Frequently Asked Questions

What is sentiment analysis? +

Sentiment analysis is a process that enables you to know what customer feels about your services, products, or brand.

What are the other alternatives to Reddit? +

There are many alternatives to Reddit, some of which are Threads, Twitter, Instagram, and more.

What is PRAW? +

PRAW is a Python package that can be used by developers to interact with API. By using PRAW, they can extract Reddit, subreddits, posts, and profiles.

Can you scrape Private Reddit Subreddits? +

No, you cannot extract private Reddit subreddits, as it is prohibited by Reddit itself. Scraping private or restricted data goes against Reddit's terms and service.

What is the need for proxies to scrape Reddit? +

If we consider Reddit Subreddits, then they are public; therefore, they can be accessed without a login. Otherwise, you may have to use a proxy to scrape Reddit data.

What are the two types of PRAW instances? +

Two types of PRAW instances are Read-only Instance and Authorized Instance.

Pick The Right Crawler For You!

Boost Your Business with targeted Data Extraction!