LinkedIn is the leading social media website and it is the best source of job related details, people, Post, and many more information we can extract from LinkedIn. There are around 706 million members using LinkedIn. By utilizing web extracting, you can collect all the data for execution inquiry. We are willing to help you choose our services to learn Scrape LinkedIn using Python and we will not let you down. This blog will display how to extract the details of a LinkedIn business page.
In this blog, we will exhibit your simple steps about how to extract Data from LinkedIn organization pages like LinkedIn Company Profile Scraping by Scraping Intelligence.
For this blog and exactly what we have done for Amazon Extractor, we need to stick with implementing simple Python or different python packages – LXML & requests. We don’t use more complex Scrapy packages in this blog.
You will have to install these:
Below the code is given to generate your individual Python LinkedIn extractor. In case, you are incapable to examine Python codes to LinkedIn Company Profile Scraping Using Python post implementing Python below, you can download it from GIST.
from lxml import html import csv, os, json import requests from exceptions import ValueError from time import sleep deflinkedin_companies_parser(url): for i in range(5): try: headers = { 'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36' } print "Fetching :",url response = requests.get(url, headers = headers,verify=False) formatted_response = response.content.replace('', '') doc = html.fromstring(formatted_response) datafrom_xpath = doc.xpath('//code[@id="stream-promo-top-bar-embed-id-content"]//text()') content_about = doc.xpath('//code[@id="stream-about-section-embed-id-content"]') if not content_about: content_about = doc.xpath('//code[@id="stream-footer-embed-id-content"]') if content_about: Pass # json_text = content_about[0].html_content().replace('','') if datafrom_xpath: try: json_formatted_data = json.loads(datafrom_xpath[0]) company_name = json_formatted_data['companyName'] if 'companyName' in json_formatted_data.keys() else None size = json_formatted_data['size'] if 'size' in json_formatted_data.keys() else None industry = json_formatted_data['industry'] if 'industry' in json_formatted_data.keys() else None description = json_formatted_data['description'] if 'description' in json_formatted_data.keys() else None follower_count = json_formatted_data['followerCount'] if 'followerCount' in json_formatted_data.keys() else None year_founded = json_formatted_data['yearFounded'] if 'yearFounded' in json_formatted_data.keys() else None website = json_formatted_data['website'] if 'website' in json_formatted_data.keys() else None type = json_formatted_data['companyType'] if 'companyType' in json_formatted_data.keys() else None specialities = json_formatted_data['specialties'] if 'specialties' in json_formatted_data.keys() else None if "headquarters" in json_formatted_data.keys(): city = json_formatted_data["headquarters"]['city'] if 'city' in json_formatted_data["headquarters"].keys() else None country = json_formatted_data["headquarters"]['country'] if 'country' in json_formatted_data['headquarters'].keys() else None state = json_formatted_data["headquarters"]['state'] if 'state' in json_formatted_data['headquarters'].keys() else None street1 = json_formatted_data["headquarters"]['street1'] if 'street1' in json_formatted_data['headquarters'].keys() else None street2 = json_formatted_data["headquarters"]['street2'] if 'street2' in json_formatted_data['headquarters'].keys() else None zip = json_formatted_data["headquarters"]['zip'] if 'zip' in json_formatted_data['headquarters'].keys() else None street = street1 + ', ' + street2 else: city = None country = None state = None street1 = None street2 = None street = None zip = None data = { 'company_name': company_name, 'size': size, 'industry': industry, 'description': description, 'follower_count': follower_count, 'founded': year_founded, 'website': website, 'type': type, 'specialities': specialities, 'city': city, 'country': country, 'state': state, 'street': street, 'zip': zip, 'url': url } return data except: print "cant parse page", url # Retry in case of captcha or login page redirection if len(response.content) < 2000 or "trk=login_reg_redirect" in url: if response.status_code == 404: print "linkedin page not found" else: raise ValueError('redirecting to login page or captcha found') except : print "retrying :",url defreadurls(): companyurls = ['https://www.linkedin.com/company/tata-consultancy-services'] extracted_data = [] for url in companyurls: extracted_data.append(linkedin_companies_parser(url)) f = open('data.json', 'w') json.dump(extracted_data, f, indent=4) if __name__ == "__main__": readurls()
You must modify the URL in the given line.
or you need to add extra URLs divided by commas for this listing
You can run and save file utilizing Python at python filename.py
A result will be seen in data.json within similar index and will display like that.
{ "website": "http://www.websitescraper.com", "description": "Scraping Intelligence is company based in USA offering affordable price. web scraping, data extraction, Products Website Scraper, Coupon Data Extractor, Amazon Product Scraping, Data Mining Service etc. ", "founded": 2009, "street": null, "specialties": [ "Web Scraping Service Provider", "Data extraction Service", "Web scraping API", "Web crawling", "Data Mining Services", "Python", "DaaS" ], "size": "51-200 employees", "city": Houston, "zip": null, "url": "https://www.linkedin.com/company/scraping-intelligence/", "country": null, "industry": "Information & Technology Services", "state": Texas, "company_name": "Scraping Intelligence", "follower_count": 41, "type": "Privately Held" }
Or if code runs for the Cisco website
companyurls = ['https://www.linkedin.com/company/cisco']
The result will appear like that
{ "website": "www.Walmart.com", "description": "Fifty years ago, Sam Walton started a single mom-and-pop shop and transformed it into the world’s biggest retailer. Since those founding days, one thing has remained consistent: our commitment to helping our customers save money so they can live better. Today, we’re reinventing the shopping experience and our associates are at the heart of it. When you join our Walmart family of brands (Sam's Club, Jet.com, Hayneedle, Modcloth, Moosejaw and many more!), you’ll play a crucial role in shaping the future of retail, improving millions of lives around the world.", "founded": 1962, "street": ", ", "specialties": [ "Retail", "Technology", "Transportation", "Logistics", "Marketing", "Merchandising", "Health & Wellness", ], "size": "10,001+ employees", "city": "Bentonville", "zip": "", "url": "https://www.linkedin.com/company/walmart/", "country": "United States", "industry": "Retail", "state": "Arkansas", "company_name": "Walmart", "follower_count": 3,019,194, "type": "Public Company" }
If you need professionals, who can help you in scraping difficult websites, contact Scraping Intelligence for any queries!