So, youâre looking for a job and would like to work smarter instead of harder to find something new and interesting? Why not create a web scraper that will gather and analyze job posting data for you?
Analyzing the URL and Page structure
Firstly, we will need to look at the demo page from indeed.
There are a few things to note regarding the URL structure:
As we develop a scraper to look at and collect information from a succession of pages, the URL structure will be helpful. Keep this in mind for future reference.
Each page will display 15 job posts, from which five are âsponsoredâ jobs, which are especially highlighted by indeed outside of normal order of results. The other 10 results are specified on the page that are viewed.
HTML tags are used to code all the information on this page. HTML (Hyper Text Markup Language) is the script that will tell your browser about displaying the content of a particular page during visiting. This will contain the general structure and organization of the document. HTML elements also include attributes, which help in keeping track of what information may be located where on a pageâs structure.
By right-clicking on a page and selecting âInspectâ from the menu that shows, Chrome users can inspect the HTML structure of the page. On the right-hand side of your website, a menu will emerge, with a long line of stacked HTML tags containing the data now visible in your web browser.
Thereâs a small box with an arrow icon in the upper-left corner of this menu. The box will turn blue when you click it. This will start to move your mouse over the page elements to see that both the tags associated with that item and the location of that item are in the HTML format for that page.
Thereâs a small box with an arrow icon in the upper-left corner of this menu. The box will turn blue when you click it. This will start to move your mouse over the page elements to see that both the tags associated with that item and the location of that item are in the HTML format for that page.
Now weâll use Python to retrieve the Xml from the page and start working on our scraper.
import requests import bs4 from bs4 import Beautiful Soup import pandas as pd import time
Letâs start by extracting a single page and figuring out the code to get each piece of data we need:
URL = âhttps://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l=New+York&start=10"#conducting a request of the stated URL above: page = requests.get(URL)#specifying a desired format of âpageâ using the html parser - this allows python to read the various components of the page, rather than treating it as one long string. soup = BeautifulSoup(page.text, âhtml.parserâ)#printing soup in a more structured tree format that makes for easier reading print(soup.prettify())
It is much easier to visit a pageâs HTML coding with prettify, and youâll get something like this:
Looking over to the task, we will search and scrape five important use cases of information from every job posting: Job Title, Company Name, Location, Salary, and Job Summary.
As said above, each job posting comes under
From there, the job titles will be listed under tags, with the attribute âtitle = (title)â. We can search the value of a tagâs attribute with tag[âattributeâ], so I can use it to search the job title for every job posting.
Script for withdrawing job title data takes three steps:
def extract_job_title_from_result(soup):
jobs = []
for div in soup.find_all(name=âdivâ, attrs={âclassâ:ârowâ}):
for a in div.find_all(name=âaâ, attrs={âdata-tn-elementâ:âjobTitleâ}):
jobs.append(a[âtitleâ])
return(jobs)extract_job_title_from_result(soup)
This will display an output like:
Company names are a bit complicated, as they appear in tags, including âclassâ:âcompanyâ. Moreover, they will be included in tags with âclassâ:âresult-link-sourceâ.
def extract_company_from_result(soup):
companies = []
for div in soup.find_all(name=âdivâ, attrs={âclassâ:ârowâ}):
company = div.find_all(name=âspanâ, attrs={âclassâ:âcompanyâ})
if len(company) > 0:
for b in company:
companies.append(b.text.strip())
else:
sec_try = div.find_all(name=âspanâ, attrs={âclassâ:âresult-link-sourceâ})
for span in sec_try:
companies.append(span.text.strip())
return(companies)
extract_company_from_result(soup)
Output of company names are displayed with a lot of white spaces around them, so inputting.strip() at the end will help to delete this while fetching the data.
Location comes under the tags. Many times, span tags are connected, in such a manner that the location text will sometimes be within âclassâ: âlocationâ attributes, or nested in the âitempropâ: âaddresslocalityâ.
However, a simple script of the loop can monitor all the span tags for text wherever it might be and fetch the important data.
def extract_location_from_result(soup):
locations = []
spans = soup.findAll(âspanâ, attrs={âclassâ: âlocationâ})
for span in spans:
locations.append(span.text)
return(locations)extract_location_from_result(soup)
Salary is the most complicated information to scrape from job postings. Many postings do not consist of any salary information. Among those who scrape the salary information, can be in one or two different situations. Hence, we need to write a script that can take multiple places for information, and need to develop a placeholder name âNothing foundâ value for those that do not contain salary information.
Some salaries come under tags, while other comes under tags, âclassâ: âsjclâ and needs to be separated by tags with no attributes.
def extract_salary_from_result(soup):
salaries = []
for div in soup.find_all(name=âdivâ, attrs={âclassâ:ârowâ}):
try:
salaries.append(div.find(ânobrâ).text)
except:
try:
div_two = div.find(name=âdivâ, attrs={âclassâ:âsjclâ})
div_three = div_two.find(âdivâ)
salaries.append(div_three.text.strip())
except:
salaries.append(âNothing_foundâ)
return(salaries)extract_salary_from_result(soup)
Last but not least, the job descriptions. Unfortunately, all the job summaries are not contained in the HTML from an Indeed website; nevertheless, we can gather some information about each job from the information provided. Selenium is a set of tools that may be used by a web scraper to browse through various links on a website and extract data from the full job advertisements.
Under the tags, youâll find summaries. The location text may be nested in âitempropâ: âaddress Localityâ tags or within âclassâ: âlocationâ tags. A simple for loop, on the other hand, can go through all span tags for the text and extract the information needed.
def extract_summary_from_result(soup):
summaries = []
spans = soup.findAll(âspanâ, attrs={âclassâ: âsummaryâ})
for span in spans:
summaries.append(span.text.strip())
return(summaries)extract_summary_from_result(soup)
We have got various information regarding a scraper. Now, we just need to collect them all into the final scraper that withdraws the necessary information for every job post, keep it separate from all other job posts, and merge all the job information into a single data frame at a single time.
We can extract the initial conditions by mentioning few pieces of information.
max_results_per_city = 100city_set = [âNew+Yorkâ,âChicagoâ,âSan+Franciscoâ, âAustinâ, âSeattleâ, âLos+Angelesâ, âPhiladelphiaâ, âAtlantaâ, âDallasâ, âPittsburghâ, âPortlandâ, âPhoenixâ, âDenverâ, âHoustonâ, âMiamiâ, âWashington+DCâ, âBoulderâ]columns = [âcityâ, âjob_titleâ, âcompany_nameâ, âlocationâ, âsummaryâ, âsalaryâ]sample_df = pd.DataFrame(columns = columns)
It goes without saying that the longer the scraping process takes, the more results you seek and the more cities you look at. This isnât a big deal if you start your scraper before going out or going to bed, but itâs something to think about.
The actual scraper is put together using the patterns we noticed in the URL structure above. We can use this knowledge to design a loop that visits every page in a precise order to retrieve data because we know how the URLs will be patterned for each page.
#scraping code:for city in city_set:
for start in range(0, max_results_per_city, 10):
page = requests.get(âhttp://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l=' + str(city) + â&start=â + str(start))
time.sleep(1) #ensuring at least 1 second between page grabs
soup = BeautifulSoup(page.text, âlxmlâ, from_encoding=âutf-8")
for div in soup.find_all(name=âdivâ, attrs={âclassâ:ârowâ}):
#specifying row num for index of job posting in dataframe
num = (len(sample_df) + 1)
#creating an empty list to hold the data for each posting
job_post = []
#append city name
job_post.append(city)
#grabbing job title
for a in div.find_all(name=âaâ, attrs={âdata-tn-elementâ:âjobTitleâ}):
job_post.append(a[âtitleâ])
#grabbing company name
company = div.find_all(name=âspanâ, attrs={âclassâ:âcompanyâ})
if len(company) > 0:
for b in company:
job_post.append(b.text.strip())
else:
sec_try = div.find_all(name=âspanâ, attrs={âclassâ:âresult-link-sourceâ})
for span in sec_try:
job_post.append(span.text)
#grabbing location name
c = div.findAll(âspanâ, attrs={âclassâ: âlocationâ})
for span in c:
job_post.append(span.text)
#grabbing summary text
d = div.findAll(âspanâ, attrs={âclassâ: âsummaryâ})
for span in d:
job_post.append(span.text.strip())
#grabbing salary
try:
job_post.append(div.find(ânobrâ).text)
except:
try:
div_two = div.find(name=âdivâ, attrs={âclassâ:âsjclâ})
div_three = div_two.find(âdivâ)
job_post.append(div_three.text.strip())
except:
job_post.append(âNothing_foundâ)
#appending list of job post info to dataframe at index num
sample_df.loc[num] = job_post
#saving sample_df as a local csv file â define your own local path to save contents
sample_df.to_csv(â[filepath].csvâ, encoding=âutf-8')
You will get own data frame for the scraped job postings after a short time.
Look how would the output be:
Contact us to request a quote!!!
Zoltan Bettenbuk is the CTO of ScraperAPI - helping thousands of companies get access to the data they need. Heâs a well-known expert in data processing and web scraping. With more than 15 years of experience in software development, product management, and leadership, Zoltan frequently publishes his insights on our blog as well as on Twitter and LinkedIn.
Explore our latest content pieces for every industry and audience seeking information about data scraping and advanced tools.
Learn how to extract Korean retail websites data to track prices, new products, and competitors, helping brands improve eCommerce decisions globally.
Learn how to build an ETL pipeline for Web Scraping with Python using clear steps, trusted libraries, and reliable data loading for business teams.
Learn how to use Web Scraping to extract eBay product prices, sales volume, and best seller data to track trends and compare competitor pricing.
Learn how to extract Google & Yahoo Finance insights using simple methods to collect stock prices, financial data & market trends for smart decisions.