How to Scrape Flight Data Using Python?

September 2, 2021
how-to-scrape-flight-data-using-python

If you are planning a weekend trip and looking for a flight then you can kayak. Check that the URL in our browser is modified accordingly after we’ve entered our search criteria and added a few additional filters like «Nonstop».

if-you-are-planning-a-weekend-trip

This URL may be broken down into several parts: origin, destination, start date, end date, and a suffix that instructs Kayak to search exclusively for close connections and arrange the results by price.

origin = "ZRH"destination = "MXP"
startdate = "2019-09-06"
enddate = "2019-09-09" 
url = "https://www.kayak.com/flights/" + origin + "-" + destination + "/" + startdate + "/" + enddate + "?sort=bestflight_a&fs=stops=0"

The overall idea now is to extract flight data we need (for example, price, departure and arrival timings) from the website’s core html code. We mostly rely on two packages to accomplish this. The first one is selenium, which controls your browser and opens the page automatically. The second is Beautiful Soup, which assists us transform the jumbled HTML code into something more structured and readable. We can simply obtain the pleasant nibbles we seek later from this «soup».

Let us initiate. We must first set up selenium. To do so, we’ll need to download a browser driver, such as ChromeDriver (make sure it matches the version of Chrome you have installed), and place it in the same folder as our Python code. Now we’ll load a couple packages and notify Selenium that we want to utilise ChromeDriver to open the URL we specified earlier.

from selenium import webdriver
	from selenium.webdriver.support.ui import WebDriverWait 
	from selenium.webdriver.support import expected_conditions
	from bs4 import BeautifulSoup
	import re
	import pandas as pd
	import numpy as np
	

	chrome_options = webdriver.ChromeOptions()
	driver = webdriver.Chrome("chromedriver.exe")
	driver.implicitly_wait(20)
	driver.get(url)

We need to figure out how to obtain the information that is important to us once the webpage has loaded. Take the departure time, for example. Using our browser settings inspect feature, we will see that the 8:55pm departure time is encased in a gap with class «depart-time base-time».

we-need-to-figure-out-how-to

We can now precisely search for the classes we’re interested in by passing the website’s html code to BeautifulSoup. A basic loop can then be used to retrieve the results. We must also restructure the results into logical departure-arrival time pairs because every search term has two departure times.

soup=BeautifulSoup(driver.page_source, 'lxml')
	

	deptimes = soup.find_all('span', attrs={'class': 'depart-time base-time'})
	arrtimes = soup.find_all('span', attrs={'class': 'arrival-time base-time'})
	meridies = soup.find_all('span', attrs={'class': 'time-meridiem meridiem'})
	    
	deptime = []
	for div in deptimes:
	    deptime.append(div.getText()[:-1])    
	        
	arrtime = []
	for div in arrtimes:
	    arrtime.append(div.getText()[:-1])   
	 
	meridiem = []
	for div in meridies:
	    meridiem.append(div.getText())  
	        
	deptime = np.asarray(deptime)
	deptime = deptime.reshape(int(len(deptime)/2), 2)
	    
	arrtime = np.asarray(arrtime)
	arrtime = arrtime.reshape(int(len(arrtime)/2), 2)      
	    
	meridiem = np.asarray(meridiem)
	meridiem = meridiem.reshape(int(len(meridiem)/4), 4) 

For the price, we employ a similar method. When looking at the pricing element, however, we could see that Kayak prefers to use various classes for their price data. As a result, to catch all situations, we must employ a regular phrase. The price is also wrapped up a little more, which is why we have to go a few extra steps to get to it.

Curious to Scrape Flight Data?

Request a Quote!
Ideas Flow

regex = re.compile('Common-Booking-MultiBookProvider (.*)multi-row Theme-featured-large(.*)')
	price_list = soup.find_all('div', attrs={'class': regex})
	    
	price = []
	for div in price_list:
	    price.append(int(div.getText().split('\n')[3][1:-1]))

Now we’re going to put everything into a neat dataframe and see what we’ve got.

df = pd.DataFrame({"origin" : origin,
"destination" : destination,
"startdate" : startdate,
"enddate" : enddate,
"price": price,
"currency": "USD",
"deptime_o": [m+str(n) for m,n in zip(deptime[:,0],meridiem[:,0])],
"arrtime_d": [m+str(n) for m,n in zip(arrtime[:,0],meridiem[:,1])],
"deptime_d": [m+str(n) for m,n in zip(deptime[:,1],meridiem[:,2])],
"arrtime_o": [m+str(n) for m,n in zip(arrtime[:,1],meridiem[:,3])]
})
now-we're-going-to-put

That’s all there is to it. All of the information that has been entangled in the html code of original flight has been scraped and reorganized. The tough lifting has been completed.

To make things a little easier, wrap the code from above into a function and use that function for our three-day travel by utilizing different destination and starting day combinations. When sending several requests, Kayak may mistakenly believe we’re a bot (and who can blame them?). The simplest approach to avoid this is to change the browser’s user agent frequently and to wait a few seconds between attempts. As a result, our entire code would look like this:

# -*- using Python 3.7 -*- from selenium 
import webdriverfrom selenium.webdriver.support.ui 
import WebDriverWait from selenium.webdriver.support 
import expected_conditionsfrom bs4 
import BeautifulSoupimport 
reimport pandas as pdimport numpy as npfrom datetime 
import date, timedelta, datetimeimport time 
def scrape(origin, destination, startdate, days, requests): 
global results enddate = datetime.strptime(startdate, '%Y-%m-%d').date() + timedelta(days) enddate = enddate.strftime('%Y-%m-%d') 
url = "https://www.kayak.com/flights/" + origin + "-" + destination + "/" + startdate + "/" + enddate + "?sort=bestflight_a&fs=stops=0" print("\n" + url) chrome_options = webdriver.ChromeOptions() agents = ["Firefox/66.0.3","Chrome/73.0.3683.68","Edge/16.16299"] 
print("User agent: " + agents[(requests%len(agents))]) 
chrome_options.add_argument('--user-agent=' + agents[(requests%len(agents))] + '"') 
chrome_options.add_experimental_option('useAutomationExtension', False) 
driver = webdriver.Chrome("chromedriver.exe", options=chrome_options, desired_capabilities=chrome_options.to_capabilities()) driver.implicitly_wait(20) driver.get(url) 
#Check if Kayak thinks that we're a bot time.sleep(5) soup=BeautifulSoup(driver.page_source, 'lxml') 
if soup.find_all('p')[0].getText() == "Please confirm that you are a real KAYAK user.": 
    print("Kayak thinks I'm a bot, which I am ... so let's wait a bit and try again") 
    driver.close() 
    time.sleep(20) 
    return "failure" 
    time.sleep(20) 
#wait 20sec for the page to load soup=BeautifulSoup(driver.page_source, 'lxml') 
#get the arrival and departure times deptimes = soup.find_all('span', attrs={'class': 'depart-time base-time'}) 
arrtimes = soup.find_all('span', attrs={'class': 'arrival-time base-time'}) 
meridies = soup.find_all('span', attrs={'class': 'time-meridiem meridiem'}) 
deptime = [] for div in deptimes: deptime.append(div.getText()[:-1]) 
arrtime = [] for div in arrtimes: arrtime.append(div.getText()[:-1]) meridiem = [] 
for div in meridies: 
    meridiem.append(div.getText()) 
    deptime = np.asarray(deptime) 
    deptime = deptime.reshape(int(len(deptime)/2), 2) 
    arrtime = np.asarray(arrtime) 
    arrtime = arrtime.reshape(int(len(arrtime)/2), 2) 
    meridiem = np.asarray(meridiem) 
    meridiem = meridiem.reshape(int(len(meridiem)/4), 4) 
#Get the price regex = re.compile('Common-Booking-MultiBookProvider (.*)multi-row Theme-featured-large(.*)') 
price_list = soup.find_all('div', attrs={'class': regex}) 
price = [] for div in price_list: 
    price.append(int(div.getText().split('\n')[3][1:-1])) 
df = pd.DataFrame({"origin" : origin, 
    "destination" : destination, "startdate" : startdate, "enddate" : enddate, "price": price, "currency": "USD", "deptime_o": [m+str(n) for m,n in zip(deptime[:,0],meridiem[:,0])], "arrtime_d": [m+str(n) 
    for m,n in zip(arrtime[:,0],meridiem[:,1])], 
    "deptime_d": [m+str(n) for m,n in zip(deptime[:,1],meridiem[:,2])], "arrtime_o": [m+str(n) for m,n in zip(arrtime[:,1],meridiem[:,3])] }) results = pd.concat([results, df], sort=False) driver.close() 
#close the browser time.sleep(15) 
#wait 15sec until the next request return "success" 
#Create an empty dataframe results = pd.DataFrame(columns=['origin','destination','startdate','enddate','deptime_o','arrtime_d','deptime_d','arrtime_o','currency','price']) requests = 0 
destinations = ['MXP','MAD']startdates = ['2019-09-06','2019-09-20','2019-09-27'] 
for destination in destinations: 
    for startdate in startdates: requests = requests + 1 while scrape('ZRH', destination, startdate, 3, requests) != "success": requests = requests + 1 
    #Find the minimum price for each destination-startdate-combinationresults_agg = results.groupby(['destination','startdate'])['price'].min().reset_index().rename(columns={'min':'price'})

We can easily visualize our results using a heatmap from seaborn once we’ve specified all combinations and grabbed the relevant data.

heatmap_results = pd.pivot_table(results_agg , values='price', 
	                     index=['destination'], 
	                     columns='startdate')
	                     
	import seaborn as sns
	import matplotlib.pyplot as plt
	

	sns.set(font_scale=1.5)
	plt.figure(figsize = (18,6))
we-can-easily-visualize-our-results

Contact Scraping Intelligence for any queries, today!

10685-B Hazelhurst Dr.#23604 Houston,TX 77043 USA

Incredible Solutions After Consultation

  •   Industry Specific Expert Opinion
  •   Assistance in Data-Driven Decision Making
  •   Insights Through Data Analysis