How to Extract LinkedIn Company Data Using Python?

March 1, 2021
how-to-extract-linkedin-company-and-profile-data-using-python

Introduction

LinkedIn is the leading social media website and it is the best source of job related details, people, Post, and many more information we can extract from LinkedIn. There are around 706 million members using LinkedIn. By utilizing web extracting, you can collect all the data for execution inquiry. We are willing to help you choose our services to learn Scrape LinkedIn using Python and we will not let you down. This blog will display how to extract the details of a LinkedIn business page.

Here are few steps to Extract LinkedIn

  1. Install& Download the up-to-date Python version
  2. Run& Copy the provided code
data-fields

Below are some Data Fields that we extract for LinkedIn Company Profile: –

  • Name of the Company
  • LinkedIn Website
  • LinkedIn Description
  • Founded Date
  • Address – City, Street, Country, Zip.
  • LinkedIn Specialties
  • Total Number of Connections

Why Extract LinkedIn?

  1. Job Explore Automation – If you have shortlisted the job data, the won’t be small and it will be an enormous database. You wish to have tools such as Google Finance, which helps you to filter firms built on norms they are available to Scrape LinkedIn posts using Python. You should take out your list and extract these data into an organized format and you can build a remarkable analysis tool.
  2. Curiosity – It shows that how curious you are about the organizations on LinkedIn or need to collect a clean & good set of information to satisfy your curiosity.

In this blog, we will exhibit your simple steps about how to extract Data from LinkedIn organization pages like LinkedIn Company Profile Scraping by Scraping Intelligence.

Wish to extract LinkedIn data?

Request a Quote!
Ideas Flow

Fundamentals of LinkedIn Extracting:

For this blog and exactly what we have done for Amazon Extractor, we need to stick with implementing simple Python or different python packages – LXML & requests. We don’t use more complex Scrapy packages in this blog.

You will have to install these:

  • Use Python 2.7 feasible here at https://www.python.org/downloads/
  • Use Python Requirements is feasible athttp://docs.python-requests.org/en/master/user/install/. Also, you require Python pip for installing this feasible here at https://pip.pypa.io/en/stable/installing/
  • Use Python LXML (Find out how to install it at http://lxml.de/installation.html

LinkedIn Extractor Using Python

Below the code is given to generate your individual Python LinkedIn extractor. In case, you are incapable to examine Python codes to LinkedIn Company Profile Scraping Using Python post implementing Python below, you can download it from GIST.

from lxml import html
	import csv, os, json
	import requests
	from exceptions import ValueError
	from time import sleep
	
	
	deflinkedin_companies_parser(url):
	    for i in range(5):
	        try:
	            headers = {
	                        'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36'
	            }
	            print "Fetching :",url
	            response = requests.get(url, headers = headers,verify=False)
	formatted_response = response.content.replace('', '')
	            doc = html.fromstring(formatted_response)
	datafrom_xpath = doc.xpath('//code[@id="stream-promo-top-bar-embed-id-content"]//text()')
	content_about = doc.xpath('//code[@id="stream-about-section-embed-id-content"]')
	            if not content_about:
	content_about = doc.xpath('//code[@id="stream-footer-embed-id-content"]')
	            if content_about:
	                Pass
	                # json_text = content_about[0].html_content().replace('','') 
                    if datafrom_xpath: try: json_formatted_data = json.loads(datafrom_xpath[0]) company_name = json_formatted_data['companyName'] 
                    if 'companyName' in json_formatted_data.keys() else None size = json_formatted_data['size'] 
                    if 'size' in json_formatted_data.keys() else None industry = json_formatted_data['industry'] 
                    if 'industry' in json_formatted_data.keys() else None description = json_formatted_data['description'] 
                    if 'description' in json_formatted_data.keys() else None follower_count = json_formatted_data['followerCount'] 
                    if 'followerCount' in json_formatted_data.keys() else None year_founded = json_formatted_data['yearFounded'] 
                    if 'yearFounded' in json_formatted_data.keys() else None website = json_formatted_data['website'] 
                    if 'website' in json_formatted_data.keys() else None type = json_formatted_data['companyType'] 
                    if 'companyType' in json_formatted_data.keys() else None specialities = json_formatted_data['specialties'] 
                    if 'specialties' in json_formatted_data.keys() else None 
                    if "headquarters" in json_formatted_data.keys(): city = json_formatted_data["headquarters"]['city'] 
                    if 'city' in json_formatted_data["headquarters"].keys() 
                    else None country = json_formatted_data["headquarters"]['country'] 
                    if 'country' in json_formatted_data['headquarters'].keys() 
                    else None state = json_formatted_data["headquarters"]['state'] 
                    if 'state' in json_formatted_data['headquarters'].keys() 
                    else None street1 = json_formatted_data["headquarters"]['street1'] 
                    if 'street1' in json_formatted_data['headquarters'].keys() 
                    else None street2 = json_formatted_data["headquarters"]['street2'] 
                    if 'street2' in json_formatted_data['headquarters'].keys() 
                    else None zip = json_formatted_data["headquarters"]['zip'] 
                    if 'zip' in json_formatted_data['headquarters'].keys() 
                    else None street = street1 + ', ' + street2 
                    else: city = None country = None state = None street1 = None street2 = None street = None zip = None data = { 'company_name': company_name, 'size': size, 'industry': industry, 'description': description, 'follower_count': follower_count, 'founded': year_founded, 'website': website, 'type': type, 'specialities': specialities, 'city': city, 'country': country, 'state': state, 'street': street, 'zip': zip, 'url': url } return data except: print "cant parse page", url 
                    # Retry in case of captcha or login page redirection 
                    if len(response.content) < 2000 or "trk=login_reg_redirect" in url: 
                        if response.status_code == 404: print "linkedin page not found" 
                        else: raise ValueError('redirecting to login page or captcha found') except : print "retrying :",url defreadurls(): companyurls = ['https://www.linkedin.com/company/tata-consultancy-services'] 
                            extracted_data = [] for url in companyurls: extracted_data.append(linkedin_companies_parser(url)) f = open('data.json', 'w') json.dump(extracted_data, f, indent=4) if __name__ == "__main__": readurls()

You must modify the URL in the given line.

or you need to add extra URLs divided by commas for this listing

You can run and save file utilizing Python at python filename.py

A result will be seen in data.json within similar index and will display like that.

{
        "website": "http://www.websitescraper.com", 
        "description": "Scraping Intelligence is company based in USA offering affordable price. web scraping, data extraction, Products Website Scraper, Coupon Data Extractor, Amazon Product Scraping, Data Mining Service etc. ", 
        "founded": 2009, 
        "street": null, 
        "specialties": [
            "Web Scraping Service Provider", 
            "Data extraction Service",
            "Web scraping API", 
            "Web crawling", 
            "Data Mining Services", 
            "Python", 
            "DaaS"
        ], 
        "size": "51-200 employees", 
        "city": Houston,
        "zip": null, 
        "url": "https://www.linkedin.com/company/scraping-intelligence/", 
        "country": null, 
        "industry": "Information & Technology Services", 
        "state": Texas, 
        "company_name": "Scraping Intelligence", 
        "follower_count": 41, 
        "type": "Privately Held"
    }

Or if code runs for the Cisco website

companyurls = ['https://www.linkedin.com/company/cisco']

The result will appear like that

{
        "website": "www.Walmart.com",
        "description": "Fifty years ago, Sam Walton started a single mom-and-pop shop and transformed it into the world’s biggest retailer. Since those founding days, one thing has remained consistent: our commitment to helping our customers save money so they can live better. Today, we’re reinventing the shopping experience and our associates are at the heart of it. When you join our Walmart family of brands (Sam's Club, Jet.com, Hayneedle, Modcloth, Moosejaw and many more!), you’ll play a crucial role in shaping the future of retail, improving millions of lives around the world.", 
        "founded": 1962, 
        "street": ", ", 
        "specialties": [
            "Retail", 
            "Technology", 
            "Transportation", 
            "Logistics", 
            "Marketing", 
            "Merchandising", 
            "Health & Wellness", 
        ], 
        "size": "10,001+ employees", 
        "city": "Bentonville", 
        "zip": "", 
        "url": "https://www.linkedin.com/company/walmart/", 
        "country": "United States", 
        "industry": "Retail", 
        "state": "Arkansas", 
        "company_name": "Walmart", 
        "follower_count": 3,019,194,
        "type": "Public Company"
    }

If you need professionals, who can help you in scraping difficult websites, contact Scraping Intelligence for any queries!