Sunday, October 1, 2017

Web Scraping

Update: This describes very basic web scraping. An updated post, Web Scraping 2 describes some intermediate functionality.

Something I've wanted to learn how to do for some time is apply automation to websites to scrape data. By way of a use case, consider a website that has some collection of data you want, but you have to perform a couple of mouse clicks, and wait for pages loads to get at each item. Then, you have to copy the data you want, and paste it into some other medium that contains the entire set (a text file, or spreadsheet, perhaps). Doing this by hand is very tedious, and horribly inefficient. If, however, we can apply automation to a web browser, we can have the browser do the work for us, and come back when it's all done.

As an example I used the Costilla County, CO Assessor's website. On this site, with a set of parcel numbers, you can obtain a good deal of information on a piece of property. First, I'll describe the environment, then we'll look at the code.

THE ENVIRONMENT


I used a Linux PC for this exercise, and Mozilla Firefox (v. 55.0.2). In addition, I installed the following packages:
  • pip (python installer - required to install Selenium)
  • Selenium - the library that allows python to interact with the browser.
  • geckodriver - the interface between Mozilla Firefox and Selenium.
  • Chromium - used for locating complex data items.

Pip is installed with yum/rpm or apt, depending upon your distribution. Selenium is installed with pip. geckodriver is installed by simply unpacking the geckodriver compressed file, and copying it to a location in your environment's path (I just dropped it in /usr/local/bin). Obviously, you'll need python. Version 2.7 came pre-installed with my distribution, so I just used that. Finally, I created a directory in which to keep all my files together. The first is my script (scrape.py), my source data (a list of parcel numbers), and geckodriver's log (created automatically by geckdriver in the directory where the script is executed. We execute the script as you might expect:
$ python scrape.py

THE SCRIPT


# 1. Imports
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
import re

# 2. Do some prep work to set up the environment
driver = webdriver.Firefox()

# 3. Setup Input and Output files
with open('numbers.list') as infile:
    parcellist = infile.read().splitlines()
outfile = open('datafile.csv','w')
outfile.write('"Parcel No.","Size","Unit","Assessed","Actual","Legal Summary"\n')

# 4. Load the County page and get past the guest login
driver.get("http://69.160.37.111/assessor/taxweb/search.jsp")
driver.find_element_by_name('submit').click()

# 5. process each file in the list in a loop
for parcelNumber in parcellist:

    # 6. drop a short delay before checking for the parcel #
    time.sleep(3) 

    # 7. First action, find the field named ParcelNumberID
    parcelbox = driver.find_element_by_name('ParcelNumberID')

    # 8. paste the current number from our list
    parcelbox.send_keys(parcelNumber)

    # 9. Bring up the data page
    parcelbox.submit()
    time.sleep(1)
    driver.find_element_by_class_name("clickable").click()
   
    # 10. Capture data items
    actual=driver.find_element_by_xpath('//*[@id="middle"]/table/tbody/tr[2]/td[3]/table[2]/tbody/tr[2]/td[2]')
    newactual = re.sub('[$,]', '', actual.text)
    legal=driver.find_element_by_xpath('//*[@class="accountSummary"]/tbody/tr[2]/td/table/tbody/tr[4]/td')
    legalSummary = re.sub('Legal Summary ', '', legal.text)
    units=driver.find_element_by_xpath('//*[@class="accountSummary"]/tbody/tr[2]/td[3]/table[2]/tbody/tr[2]/td[4]')
    propertyType=driver.find_element_by_xpath('//*[@class="accountSummary"]/tbody/tr[2]/td[3]/table[2]/tbody/tr[2]/td[1]')
    assessed=driver.find_element_by_xpath('//*[@class="accountSummary"]/tbody/tr[2]/td[3]/table[2]/tbody/tr[2]/td[3]')
    assessedValue = re.sub('[$,]', '', assessed.text)
    unitOfMeasure=driver.find_element_by_xpath('//*[@id="middle"]/table/tbody/tr[2]/td[3]/table[2]/tbody/tr/th[4]')
   
    # 11. Output data
    stringData = '"' + parcelNumber + '",' + units.text + ',"' + unitOfMeasure.text + '",' + assessedValue + ',' + newactual + ',"' + propertyType.text + '","' + legalSummary + '"\n'
    outfile.write(stringData)

    # 12. Back to main search page
    driver.find_element_by_link_text('Account Search').click()

# 13. Cleanup
outfile.close


CODE WALK-THROUGH

1. We need to import modules for all of the work we're doing.

2. We create a WebDriver object with which we interact. This is the object we use to send our commands to Selenium code, and get information back from the browser.

3. This section is reading the set of parcel numbers in our file, and creating a list object. We'll iterate through that object further down to get data. We also create a file into which we'll write the data, and start it off with a header for the data items we wish to capture.

4. Now we start the real work. This section tells the browser to navigate to the starting webpage. On that page, you have the option to login as a guest, or as a user with an account (presumably, fee-based). We're going to login as a guest, so we locate the name of the login submit button, and tell the browser to click it.

Locating HTML Objects

This is probably a good time to talk about how we tell the browser to do something with one item in the page versus another. I used Mozilla's Inspector tool (accessible from the menu). This tool allows you to view the page source, and it will highlight the item in the page that the highlight line in the source is sitting on. By looking at the HTML tags around the highlighted line in the bottom of the screen, you may be able to pick out a 'name' or 'id' tag.

Using Inspector tool to locate the tags describing an item on the page.


If so, that makes it easy to pass into Selenium. In our test, the Login button has the name "submit", so that's what we used to to issue a click() action against.

5. Now we iterate through the list, copying each parcel number from our input file into the variable parcelNumber.

6. We drop in a sleep command to pause the script for just a moment. Without this, we get an error that the item we're lloking for doesn't exist on the page. Essentially, we're checking for the parcel number box before the page loads fully, so we'll slow things down just a bit.

7. We create an object (parcelbox) that represents the ParcelNumberID field in the webpage. Again, we used the Inspector tool to find the name. We could just as easily the driver.find_element_by_id function here. as the page designer set both the field's name property and id property to 'ParcelNumberID'.

8. In this step, we insert the current parcel number (parcelNumber variable) into the text field.

9. Now, we click the submit button. That brings up a page that lists all of the properties matching the search criteria. (Searching on things other than the parcel number key could result in multiple items). Each item in this list is tagged as being in the 'clickable' class. Since we're using the parcel number to search, we only ever get one response back, so we don't need to worry about multiple 'clickable' class objects, we just specify that we want to click the only 'clickable' class object no the page, which is a ling to the parcel's individual page.

10. Now we have a page full of data items, some of which we want to capture. Looking at the Inspector tool, these items are deeply embedded in multiple levels of tags on the page, and their name and id tags are not set. This presents a problem. We're going to leverage another tool exposed by Selenium: Xpath.

Obtaining Xpath References

Xpath is a standard for referencing individual items in a web page. This is the perfect tool for locating a specific cell of data in a large page with no uniquely identifying tags. Unfortunately, the syntax of Xpath is very complicated, so we use a Chrome extension to locate the Xpath reference for us. Xpath Generator 3.0.0 is the tool we use. Once it's installed, navigate to the county assessor page, search for a property, and bring up its data page. Next, click on the Xpath icon located in the browser. Click on the item you want to locate, and Xpath Generator will give you one or more xpath references to that item. Click on each reference to find the item that best locates the data you are looking for. Some will highlight just the data you're looking for, some will highlight the table or row it sits in. There may also be multiple xpaths to the specific data item. choose the one that seems best. By 'best' I mean can be used over and over in our script. If the xpath reference uses something that matches the text currently in that field, that would be a bad choice, as that data will likely change on the next iteration, thus no hit would be found. On the other hand, selecting a reference that looks like it uses row, column or cell numbers is a better choice, as the data will likely always be presented in the same place in the page for each property we are searching for.

Back to step 10. We create variables, and assign them the contents of each data item in the page that we are interested in. For some data items, we might want to do a bit of cleanup. For example, currency amounts get captured, then we remove any dollar signs or commas. That makes the currency amount much easier to work with.

11. Once we have all of our data items, we put them into a string (in this case, I'm also inserting commas and quotes so the file will end up being a .csv file), and output the string into our output file under the header row.

12. This is the last step in the loop. We need to click the Search button so we go back to the page we were on when the for loop started, and the entire process repeats until there are no more parcel number to search for.

13. Last step, always perform cleanup by closing out files. Yes, the python interpreter will detect any open files and close them, but it's a good habit to always close any file you open. (We did not try to close the input file object, because the way we opened it (with <variable> as file:) performs the close automatically as part of the process. An attempt to close the file would have produced an error.

So that's it. We not have a csv file full of data items that we can do something else with.

References

Xpath Tutorial
Page on how to locate data items with Selenium