Product research on Amazon used to mean hours of manual browsing. Now, a well-written Python script can pull hundreds of ranked product records in minutes. The Amazon Best Sellers list is one of the most data-rich pages on the internet for e-commerce research. Rankings shift every hour based on real sales, not estimates, which gives the data a freshness that most paid market tools cannot match.
This guide is written for developers who want working code and a clear understanding of what is involved. It covers library selection, two complete scraper implementations, pagination, anti-detection, and data storage. Businesses that prefer managed delivery without maintaining scrapers can use Scraping Intelligence to access structured Amazon product data on a set schedule.
Every product category on Amazon carries a Best Sellers page. The products listed there are ranked by recent purchase volume, and the rankings refresh every hour. That frequency is what makes the data genuinely useful. A product jumping from rank 80 to rank 12 inside a single day tells you something a weekly sales report never could.
Each product record on a Best Sellers page contains several extractable fields:
Tool selection depends on scale and frequency. A single-category job that runs once a week needs something different from a daily pipeline covering 30 categories across multiple marketplaces. The table below covers the standard options and their roles.
| Library or Tool | Role | When to Use It |
|---|---|---|
| requests | HTTP client | Lightweight scraping of static page content |
| BeautifulSoup4 | HTML parser | Navigating product cards and extracting field values |
| Selenium or Playwright | Browser automation | Pages where price or review data loads via JavaScript |
| Scrapy | Full crawl framework | Multi-category pipelines that run on a schedule |
| Rotating proxy service | IP management | Distributing requests to avoid IP-level blocks |
| pandas | Data processing | Cleaning, deduplicating, and exporting scraped records |
| fake-useragent | Header rotation | Cycling browser fingerprints across requests |
The scraper below is built on requests and BeautifulSoup. It rotates user agents, retries on failure with increasing wait times, spaces requests to stay under detection thresholds, and exports results to CSV. This is the starting point the Scraping Intelligence team uses for single-category Amazon Best Sellers extraction before scaling up.
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
import random
from fake_useragent import UserAgent
class AmazonBestSellerScraper:
# Scraping Intelligence | scrapingintelligence.com
def __init__(self):
self.ua = UserAgent()
self.session = requests.Session()
self.products = []
def get_headers(self):
return {
'User-Agent': self.ua.random,
'Accept-Language': 'en-US,en;q=0.9',
'Accept-Encoding': 'gzip, deflate, br',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Connection': 'keep-alive',
'Referer': 'https://www.amazon.com',
'DNT': '1'
}
def fetch_page(self, url, retries=3):
for attempt in range(retries):
try:
resp = self.session.get(url, headers=self.get_headers(), timeout=15)
if resp.status_code == 200:
return resp.text
elif resp.status_code == 503:
print(f'Rate limited. Waiting before attempt {attempt + 1}...')
time.sleep(random.uniform(5, 15))
except requests.RequestException as e:
print(f'Request error on attempt {attempt + 1}: {e}')
time.sleep(random.uniform(3, 8))
return None
def parse_products(self, html):
soup = BeautifulSoup(html, 'lxml')
cards = soup.select('.zg-grid-general-faceout')
for card in cards:
try:
rank_el = card.select_one('.zg-bdg-text')
title_el = card.select_one('._cDEzb_p13n-sc-css-line-clamp-3_g3dy1')
if not title_el:
title_el = card.select_one('.p13n-sc-truncate-desktop-type2')
price_el = card.select_one('.p13n-sc-price')
rating_el = card.select_one('.a-icon-alt')
reviews_el = card.select_one('.a-size-small')
link_el = card.select_one('a.a-link-normal')
asin = ''
if link_el and '/dp/' in (link_el.get('href') or ''):
asin = link_el['href'].split('/dp/')[1].split('/')[0].split('?')[0]
self.products.append({
'rank': rank_el.text.strip() if rank_el else 'N/A',
'title': title_el.text.strip() if title_el else 'N/A',
'price': price_el.text.strip() if price_el else 'N/A',
'rating': rating_el.text.strip() if rating_el else 'N/A',
'reviews': reviews_el.text.strip() if reviews_el else 'N/A',
'asin': asin,
'url': f'https://www.amazon.com/dp/{asin}' if asin else 'N/A'
})
except Exception as e:
print(f'Skipped card: {e}')
def scrape_category(self, category_url, max_pages=2):
for page in range(1, max_pages + 1):
if page == 1:
url = category_url
else:
url = f'{category_url}ref=zg_bs_pg_{page}?_encoding=UTF8&pg={page}'
print(f'Page {page}: {url}')
html = self.fetch_page(url)
if html:
self.parse_products(html)
wait = random.uniform(2, 5)
print(f' Sleeping {wait:.1f}s...')
time.sleep(wait)
def save_to_csv(self, filename='best_sellers.csv'):
df = pd.DataFrame(self.products)
df.drop_duplicates(subset='asin', inplace=True)
df.to_csv(filename, index=False, encoding='utf-8-sig')
print(f'Saved {len(df)} products to {filename}')
if __name__ == '__main__':
scraper = AmazonBestSellerScraper()
scraper.scrape_category(
'https://www.amazon.com/Best-Sellers-Electronics/zgbs/electronics/',
max_pages=2
)
scraper.save_to_csv('electronics_best_sellers.csv')
Once a project grows past a handful of categories or needs to run on a daily schedule, Scrapy is the right move. The request queue manages concurrency, the pipeline system handles export, and middleware lets you plug in proxy rotation without touching the core spider logic. Scraping Intelligence uses Scrapy for any Amazon data extraction job covering four or more categories regularly.
import scrapy
import re
class BestSellersSpider(scrapy.Spider):
name = 'best_sellers'
custom_settings = {
'DOWNLOAD_DELAY': 3,
'RANDOMIZE_DOWNLOAD_DELAY': True,
'DEFAULT_REQUEST_HEADERS': {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
'Accept-Language': 'en-US,en;q=0.9',
},
'FEEDS': {
'best_sellers.csv': {'format': 'csv', 'overwrite': True}
}
}
start_urls = [
'https://www.amazon.com/Best-Sellers-Electronics/zgbs/electronics/',
'https://www.amazon.com/Best-Sellers-Books/zgbs/books/',
]
def parse(self, response):
for card in response.css('.zg-grid-general-faceout'):
title = card.css('._cDEzb_p13n-sc-css-line-clamp-3_g3dy1::text').get()
if not title:
title = card.css('.p13n-sc-truncate-desktop-type2::text').get()
link = card.css('a.a-link-normal::attr(href)').get()
asin_match = re.search(r'/dp/([A-Z0-9]{10})', link or '')
yield {
'rank': card.css('.zg-bdg-text::text').get('N/A').strip(),
'title': title.strip() if title else 'N/A',
'price': card.css('.p13n-sc-price::text').get('N/A').strip(),
'rating': card.css('.a-icon-alt::text').get('N/A').strip(),
'asin': asin_match.group(1) if asin_match else '',
'source': response.url
}
next_page = response.css('li.a-last a::attr(href)').get()
if next_page:
yield response.follow(next_page, callback=self.parse)
# Run from terminal:
# scrapy runspider best_sellers_spider.py -o output.csv
Amazon's detection system looks at multiple signals in parallel: request rate, header patterns, IP history, and behavioral consistency across sessions. Fixing one layer without the others gives inconsistent results. What works reliably in production:
The table below maps extractable fields to their location in the page HTML and notes the primary business use case for each one. CSS selectors occasionally change when Amazon updates its frontend, so validating them against the live page before any major run is a good habit.
| Data Field | CSS Selector or Source | Primary Use |
|---|---|---|
| Best Seller Rank | .zg-bdg-text | Track rank movement over time |
| Product Title | ._cDEzb_p13n-sc-css-line-clamp-3_g3dy1 | Keyword Research and Competitive Analysis |
| ASIN | Parsed from /dp/ in href | Unique product ID for downstream API calls |
| Price | .p13n-sc-price | Price Monitoring and repricing decisions |
| Star Rating | .a-icon-alt | Quality filter across product categories |
| Review Count | .a-size-small | Demand signal and social proof measure |
| Product Image URL | src attribute of .s-image | Visual catalog building and ML training sets |
| Brand Name | .a-size-small.a-color-base | Brand share and market concentration analysis |
| Category Path | Breadcrumb elements | Taxonomy mapping and segment classification |
Amazon displays 50 products per page and provides one additional page of 50, giving 100 products per category total. Subcategories carry their own independent Best Sellers lists with the same two-page structure. The pagination URL pattern is predictable and easy to generate programmatically.
def scrape_all_pages(self, base_url):
all_products = []
for page_num in range(1, 3): # Pages 1 and 2
if page_num == 1:
url = base_url
else:
url = f"{base_url}ref=zg_bs_pg_{page_num}?_encoding=UTF8&pg={page_num}"
print(f'Fetching page {page_num}: {url}')
html = self.fetch_page(url)
if html:
self.parse_products(html)
time.sleep(random.uniform(2.0, 4.5))
return all_products
Storage format is dictated by what happens to the data after extraction. Scraping Intelligence delivers CSV for clients doing spreadsheet analysis and JSON or database format for teams feeding the data into internal tools or APIs. Both options are covered below.
def save_to_csv(products, filename='best_sellers.csv'):
df = pd.DataFrame(products)
df.drop_duplicates(subset='asin', inplace=True)
df['scraped_at'] = pd.Timestamp.now().strftime('%Y-%m-%d %H:%M:%S')
df.to_csv(filename, index=False, encoding='utf-8-sig')
print(f'Exported {len(df)} records to {filename}')
import sqlite3
def save_to_sqlite(products, db_name='amazon_data.db'):
df = pd.DataFrame(products)
df['scraped_at'] = pd.Timestamp.now()
conn = sqlite3.connect(db_name)
df.to_sql('best_sellers', conn, if_exists='append', index=False)
conn.close()
print(f'Saved {len(df)} records to {db_name}')
These are the errors that show up repeatedly across Amazon scraping projects. Each one has a direct fix.
| Error | Root Cause | Fix |
|---|---|---|
| 503 Service Unavailable | Request rate too high | Add random delays; rotate proxy IPs |
| Empty product list | CSS selectors are outdated | Inspect live page HTML and update selectors |
| CAPTCHA page returned | Bot pattern detected by Amazon | Slow requests; integrate CAPTCHA solver API |
| Price field missing | Price loads via JavaScript | Switch to Playwright or Selenium for rendering |
| Duplicate records in output | Pagination pages overlap on shared listings | Deduplicate on the ASIN field using pandas |
| Connection timeout | Network or proxy failure | Add retry logic with progressive backoff |
Amazon runs separate storefronts in more than ten countries. The Best Sellers URL structure is identical across all of them, so scraping international Amazon product data is mostly a matter of swapping the base domain. All the selector and pagination logic carries over without modification.
MARKETPLACES = {
'US': 'https://www.amazon.com/Best-Sellers/zgbs/',
'UK': 'https://www.amazon.co.uk/Best-Sellers/zgbs/',
'DE': 'https://www.amazon.de/Best-Sellers/zgbs/',
'IN': 'https://www.amazon.in/gp/bestsellers/',
'JP': 'https://www.amazon.co.jp/Best-Sellers/zgbs/',
'CA': 'https://www.amazon.ca/Best-Sellers/zgbs/',
'FR': 'https://www.amazon.fr/Best-Sellers/zgbs/',
'IT': 'https://www.amazon.it/Best-Sellers/zgbs/',
'ES': 'https://www.amazon.es/Best-Sellers/zgbs/',
'AU': 'https://www.amazon.com.au/Best-Sellers/zgbs/'
}
for market, base_url in MARKETPLACES.items():
url = f'{base_url}electronics/'
print(f'Scraping {market}: {url}')
time.sleep(random.uniform(3, 7))
The techniques covered here give you everything needed to extract Amazon Best Sellers data reliably: a working single-category scraper, a Scrapy implementation for larger jobs, pagination logic, detection countermeasures, and storage options. The BeautifulSoup approach is the right starting point. Scrapy takes over once the scope or frequency grows beyond what a simple script can sustain.
Teams that need Amazon product data on a consistent schedule but do not own the extraction infrastructure can contact Scraping Intelligence. The service covers Amazon Best Sellers scraping across all major categories and marketplaces, with hourly delivery, custom schemas, and selector maintenance included.
Scraping Intelligence Editorial Team is a collective of data specialists, analysts, and researchers with expertise in web scraping, data extraction, and market intelligence. The team produces well-researched guides, actionable insights, and industry-focused resources that help businesses unlock the value of data and make informed, strategic decisions.
Explore our latest content pieces for every industry and audience seeking information about data scraping and advanced tools.
Learn how to scrape bank and credit card offers from retailer websites to extract deals, cashback, reward points, promo codes & EMI offers with ease.
Deliveroo Data Scraping for UK restaurant menus, prices, and reviews. Get valuable insights for competitor price tracking and market trends.
Stop copy-pasting Amazon Author Data manually. Use Python to automate author data extraction fast, and scale your web scraping effortlessly.
Learn how healthcare data extraction turns billing info into structured insights to improve patient care & reduce high operational costs effectively.