How to Web Scrape Corporate Profiles with Python

Editor’s Note: this post presumes that the reader has at least a passing knowledge of the programming languages Python and HTML. If this does not apply to you, sorry, you may not enjoy reading this. Consider looking at the crash courses on Python and HTML at Python For Beginners and Beginners’ Guide to HTML.

This post will explain how to use Python to scrape employee profiles from corporate websites. This is the first project in this website’s series of posts on web scraping and provides a good primer for the subject matter. Subsequent posts explain increasingly advanced projects.

Web scraping employee profiles on company websites is basically an advanced version of copy and paste. Below is a very simple Python script that written for scraping employee profiles and it can be applied to almost any corporate website with only minor edits.

To use this script, you must provide the url for your webpage and identify the HTML elements that contain the employee profile information that you want to scrape. Run the script and it will produce a CSV file with the scraped information.

What is an HTML element?

I believe this is best answered by Wikipedia:

This means that any piece of text in a website is part of an HTML element and is encapsulated in an HTML tag. If you right click on a webpage and choose to view the HTML, you will see a simple bit of HTML code before and after each segment of text. See the diagram below showing different parts of an HTML element for a line of text that simply reads “This is a paragraph.”

In order to scrape the content “This is a paragraph.” using Python, you will identify the content based on its HTML tag. So instead of telling your script to scrape the text that reads “This is a paragraph.”, you tell it to scrape the text with the “p” tag.

However, there could be other elements in the page with the same tag, in which case you would need to find include more information about the element to specify the one you wanted. Alternatively, if there were 1,000 elements with the same tag and you wanted to scrape all of them (imagine a list of 1,000 relevant email addresses), you can just tell you script to scrape the text with the “p” tag and it will get all 1,000 of them.

Scraping Profiles

1 – The first step is to find the webpage where the employee profiles reside. (for the sake of simplicity, we will only use one webpage in our example today). Copy and paste this url into the script as the value assigned to the “url” variable. For my test case, I will use the webpage for the Exxonmobil Management Committee (https://corporate.exxonmobil.com/Company/Who-we-are/Management-Committee). I chose a page with only a few profiles for the sake of simplicity, but you can use any profile page regardless of have many profiles are listed on it.

2 – The next step is to choose what information to scrape. For my example I will choose only the employees’ names and positions.

3 – To scrape a specific kind of data (such as job titles) repeatedly, you need to find out how to identify that information in the HTML. This is essentially how you will tell your script which info to scrape.

If you go to my example webpage you see that the job title on each employee profile looks similar (text font and size, formatting, etc). That is, in part, because there are common characteristics in the HTML elements associated with the job titles on the webpage. So you identify those common characteristics and write in the python web scraping script “get every HTML element with this specific characteristic and label them all as ‘Job Titles.'”

5 – Here is the tricky part, you need to identify this information based on its location in the HTML framework of the website. To do this, you find where the relevant information is located on the webpage. In my example, see the photo below, I want to scrape the employees’ position titles. The first profile on the page is Mr. Neil Chapman, Senior Vice President. So I need to figure out how to identify the location of the words “Senior Vice President” in the website’s HTML code. To do this, I right-click my cursor on the words “Senior Vice President” and choose “inspect.” every browser has its own version of this, but the option should include the term “inspect.” This will open up a window in my browser that shows the HTML and highlights the item I clicked on (“Senior Vice President”) in the HTML code. See the photo below and it shows that clicking on that text in the website will identify that the same text is located within the HTML framework between the “<h4>” tags.

In our script below you will see that on line number 12, we identify that the position title is located in the text for the h4 tag and it correlates with the text for the h2 tag with class = “table-of-contents”.

Then, for a test case, you run this script below

import requests
from bs4 import BeautifulSoup
import csv
rows = []
url = 'https://corporate.exxonmobil.com/Company/Who-we-are/Management-Committee'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
content = soup.find("div", class_="article--wrapper")
sections = content.find_all("section", class_="articleMedia articleMedia-half")
for section in sections:
    name = section.find("h2", class_="table-of-contents").text
    position = section.find("h4", class_=[]).text
    row = {"name": name,
           "position": position
           }
    rows.append(row)
with open("management_board.csv", "w+") as csvfile:
    fieldnames = ["name", "position"]
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
    writer.writeheader()
    for row in rows:
        writer.writerow(row)

The result is a csv file, located in the same folder where this script is saved, that will have this information:

Okay so that obviously was a bit of work for something that you could have just copied and pasted. Why should you still be interested?

For starters, now you can identify any information in the profiles and add it to the script. Identify the location of the information and add it in with a new variable in under “for section in sections:” under “name” and “position” and add whatever title you want in “fieldnames”.

Furthermore, this same script, without any alterations, would also work if there were one thousand profiles on that page.

Finally, this is a very effective method for webpages that you cannot copy and paste at all. For example, imagine if you run into something like this increasingly popular, according to Bluleadz, kind of Employee Profiles page. Notice how the page requires that you run your cursor over a team member to see their information?

The aforementioned method of web scraping can scrape all of the unseen profile information in one quick go and present it in a friendly format.

As webpages design continues to develop, these kinds of techniques will prove invaluable.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s