Web-scraping with Google Sheets and XPath

One of the more advanced features available to Google users, is the user’s ability to scrape data – from the web – directly into a Google spreadsheet through utilizing the ‘=IMPORTXML‘ function, combined with XPath.

The script to follow outlines this functionality by scraping a simple data table from a public website and loading the results into a spreadsheet .

In practice, I utilize this functionality quite extensively in searching for tenders and sales opportunities from public tender websites. This saves me from spending countless hours sifting through multiple pages to find opportunities. With this function I’m able to ‘aggregate’ information from the various pages into one sheet, making keyword searches a breeze.

The application of this functionality is limited only by the access policies of the sites you wish to scrape and the “cleanliness” of the code used to develop the site.

(In order to scrape data with IMPORTXML and XPath, you need a basic understanding of HTML and CSS. You can stay tuned for an upcoming video tutorial that will explain this script, XPath, HTML and CSS in more detail.)

You can read more about the “IMPORTXML” syntax here.

In the meantime, feel free to copy the code below into your own spreadsheet and have some fun exploring the wonderful world of web-scraping 🙂

We will be scraping data from the following location:

We will extract the table below, from the page mentioned above:

In order to obtain the HTML parent and child elements/tags for the table above (which is required in order to complete the XPath query), navigate to the website, press F12 to open the developer tab (in Chrome), right click on the table, and the following developer information should be highlighted in the developer tab:

From the image above, we can see that the information (<tr>) we want is wrapped in <table> tags (that form a table element) with a class of “wikitable”. The information itself is wrapped table row <tr> tags. The <table> and <tr> elements as well as the class will be utilized in completing our XPath query.

Before we get started, just a little side note…

The ‘IMPORTXML’ function can be entered in one of two way:

  1. Directly into an empty cell in a sheet; or
  2. Via the assistance of a Google Script (as in our example below).

The reason for utilizing a Google Script to add the formula, is because we want the script to add the table multiple times without us having to “tell” Google Sheets where to add the table. The script below will add the table once and then “detect” the next open row and add the table again, and then once again “detect” the next open row and add the table again.

This automates the importing of data to a certain extent by making a “smart sheet”.

In the code to follow, I’ve taken the “long” route . There are shorter (and smarter ways) to achieve the same results, but more on that in later tutorials.

Let’s get started…

Step 1: Create a new Google Workbook (I’m old-school Excel, so I still refer to workbooks).

Step 2: Give one of the sheets in the workbook the name of ‘Get_Data’.

Step 3: Open the ‘Script editor’ located under ‘Tools’ in the ribbon.

Step 4: Clear the code in the script file and copy the code below:

//@OnlyCurrentDoc

/**************************************************************** Script Setup ****************************************************************/

// The following code will add a quick-access menu item to the ribbon
function onOpen() {
  // The code below can be written in a single line
  SpreadsheetApp.getUi()
      .createMenu('Scraping Tools')
      .addItem('Load Data', 'loadData')
      .addToUi();
}

// Initializes the 'onOpen' function
function onInstall(){
  onOpen ()
}

/**************************************************************** Step 1: Spreadsheet Setup ****************************************************************/

// First we have to select the Spreadsheet where the data is to be added.
// In this case the sheet we want to use is called 'Get_Data'
// If you want to run this script using another sheet within your workbook, make sure the name of the sheet you want to use has been entered below.
// The custom function that we will execute is called 'loadhData'

function loadData() {
  // 'Select' the sheet where where the data will be added
  var sheet = SpreadsheetApp.getActiveSpreadsheet().getSheetByName('Get_Data');
  // Clear all existing data on the sheet, but maintain formatting such as font color, shading, etc.
  sheet.clearContents();
  
/**************************************************************** Step 2: Add the data table ****************************************************************/
  
  // Create a heading for the first import of the table
  // Cell 'A1' is set as the first active cell in the 'Get_Data' sheet
  sheet.getRange("A1").setValue('Wiki Employment Table Data - Import 1');
  
  // Add the data from the Wiki Page
  // Add the 'IMPORTXML' formula in cell 'A2'
  sheet.getRange("A2").setFormula('=IMPORTXML("https://en.wikipedia.org/wiki/Employment","//table[@class=\'wikitable\']//tr")');
  
/**************************************************************** Step 3: Add the data table again ****************************************************************/  
  
  // Determine the last row in the 'Get_Data'sheet that contains data
  var lastRow = sheet.getLastRow();
  // Determine the first empty/open row
  var openRow = lastRow + 1;
  // Find the first open cell in the open row
  var openCell = ("A" + openRow);
  // Select the open cell
  sheet.getRange(openCell).activate();
  // Add a heading in the first open cell
  sheet.getActiveCell().setValue('Wiki Employment Table Data - Import 2');
  
  // Determine the last row in the 'Get_Data'sheet that contains data
  var lastRow = sheet.getLastRow();
  // Determine the first empty/open row
  var openRow = lastRow + 1;
  // Find the first open cell in the open row
  var openCell = ("A" + openRow);
  // Select the open cell
  sheet.getRange(openCell).activate();
  // Enter the 'IMPORTXML' formula in the first open cell.
  sheet.getActiveCell().setFormula('=IMPORTXML("https://en.wikipedia.org/wiki/Employment","//table[@class=\'wikitable\']//tr")');
  

/**************************************************************** Step 4: Add the data table again ****************************************************************/
// The code below is th same as the code above, just without comments  

  var lastRow = sheet.getLastRow();
  var openRow = lastRow + 1;
  var openCell = ("A" + openRow);
  
  sheet.getRange(openCell).activate();
  sheet.getActiveCell().setValue('Wiki Employment Table Data - Import 3');
  
  var lastRow = sheet.getLastRow();
  var openRow = lastRow + 1;
  var openCell = ("A" + openRow);
  
  sheet.getRange(openCell).activate();
  sheet.getActiveCell().setFormula('=IMPORTXML("https://en.wikipedia.org/wiki/Employment","//table[@class=\'wikitable\']//tr")');
}


Step 5: Save all changes and refresh the spreadsheet.

Step 6: Once the sheet has been refreshed, you should see a ‘Scraping Tools’ button appear on the ribbon. Click on ‘Scraping Tools’ -> ‘Load Data’ and the code should scrape information from the website used in this example, into your spreadsheet.

You might be prompted with the following screens the first time you execute the script:

  1. Click ‘Continue

2. Click on your user account details

3. Click ‘Allow’

4. Once the script has been executed correctly, your results should look something like this:

In the image above you can see that that table has been added three times with the custom headings that we specified in our script.

Something to note:

If you enter the IMPORTXML formula as part of a script, we enter it as follows:

sheet.getRange("A2").setFormula('=IMPORTXML("https://en.wikipedia.org/wiki/Employment","//table[@class=\'wikitable\']//tr")');


If we however enter it directly into a cell in our spreadsheet, we enter it as follows:

=IMPORTXML("https://en.wikipedia.org/wiki/Employment","//table[@class='wikitable']//tr")


Take note of the exclusion of the forward slashes ( \ ) in the second piece of code. We use the forward slashes in the first example to ‘exit out’ special characters like ‘ ‘. If we don’t add the forward slash symbol in the first piece of code, before the inverted commas ( \’), the formula will not execute/parse correctly.

That brings us to the end of this tutorial.

If you have any comments or suggestions, please feel free to leave a comment in the comment section below.

Until next time…

(The views and opinions expressed in this article are those of the authors and do not necessarily reflect the official policy or position of the author’s current employer. Examples of any analysis performed within this article or related media, are to be used for illustrative purposes only.)

Leave a Reply

Your email address will not be published. Required fields are marked *