IMPROVEMENTS AND CHANGING BUSINESS REQUIREMENTS
In my
previous post on web scraping, I described a very basic script to cull data from the Internet. That was a pretty good first attempt, and the simplicity of the web application, as well as the business requirements didn't require a great deal of coding. The follow-up however, required some additional work as the previous script would not handle many of the new issues that arose. In this script, we are working with data from the Apache County, Colorado Assessor's Office. While the web application is the same, there are some changes that require some additional work. Specifically, we address the following changes:
- We need a better way to wait for elements to become available on the webpage. Adding time.sleep() statements is fine, but we can't always be sure that the number of seconds we specify will be sufficient, and the more time we specify, the longer the script will require to run, as time.sleep() is a simple wait command. We need something that waits for the page to load completely, and if that happens in .75 seconds, then processing should continue after .75 seconds. If it takes too long, a timeout should interrupt the script.
- We need to add some logic to identify records that we do not want. For example, if there certain records that we are not interested in, we should break out of the processing for that record immediately, and conintue to the next. This reduces the file size, but more importantly, it reduces the run time of the script, and reduces the code path (thus eliminating the potential for errors to arise, and cause an unexpected failure).
- We need a better XPath tool. After a recent update, the XPath generator tool we used in the last article stopped working.
- We need a way to handle the possibility of multiple items in a list that might be returned from a single search.
- The data we need might be split among multiple screens. We should be able to move between screens to capture all of the data we want.
- Some of the data we need may be in a frame. We need a way to be able to select the frame which contains the data we are looking for.
- We need a way to access objects which may not be visible on the screen.
- Sometimes we hit a parcel ID that is not (no longer?) in the system. This produces an error that will stop the scraper and throw an exception. We need to gracefully handle 'parcel not found' errors.
- Along with the previous item, it would be useful to log what takes place with each record from the original file. Adding a log file would allow us to record successes and failures, and the reason a given parcel could not be retrieved.
UPDATED CODE
The following code addresses each of the issues highlighted above. As with the previous post, we'll make notes in code, then explain in more detail below.
# 1. Imports
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException
import time
import re
import csv
# 2. Create a webdriver object.
driver = webdriver.Firefox()
# 3. Setup Input and Output files
# Input file format is: AcctNo,ParcelNo.
with open('allkeys.csv', 'rb') as f:
reader = csv.reader(f)
parcelList = list(reader)
totItems=len(parcelList) # get the count of total items for status.
# 4. Change the mode of the output file from write to append.
outfile = open('datafile.csv','a')
outfile.write('"Account No.","Parcel No.","Legal Class","Unit of Meas.","Parcel Sz.","Short Owner Name","Address 1","City","State","Zip Code"\n')
logfile = open('ApacheCoScraper.log','a')
# 5. A counter is used to tell us how far along we are. It's used below.
iCntr = 0
# 6. Read a row from the input file - this contains 2 fields now.
for row in parcelList:
# 7. Load the County page and get past the splash page
driver.get("http://www.co.apache.az.us/eagleassessor/")
time.sleep(3)
driver.switch_to_frame(driver.find_element_by_tag_name("iframe"))
# 8. Scroll up to see access the submit button.
submitElement = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.NAME, 'submit')))
driver.execute_script("arguments[0].scrollIntoView(false);", submitElement)
submitElement.click()
# 9. Wait for the page to load (applies to remainder of script)
driver.implicitly_wait(15) # seconds
# 10. Search for a parcel number, and bring up the Account Summary page
accountNumber,parcelNumber = row
parcelbox = WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.NAME, 'ParcelNumberID')))
parcelbox.send_keys(parcelNumber)
parcelbox.submit()
# 11. Try to find a warning message (indicates 'parcel not found').
try:
warningMsg = driver.find_element_by_class_name('warning')
except NoSuchElementException:
pass else:
iCntr += 1
print "%s (%d/%d) not found." % (parcelNumber, iCntr, totItems)
logfile.write(parcelNumber + '(' + str(iCntr) + '/' + str(totItems) + ') not found.\n')
continue
# 12. Multiple entries may appear, select the one we need.
validRowLink = driver.find_element_by_link_text(accountNumber)
try:
validRowLink.click()
except:
driver.find_element_by_class_name("clickable").click()
# 13. Assume we only want records with a type of "02.R (land only). Ignore anything that is not 02.R
raw_Legal_Class = driver.find_element_by_xpath("//*[@id='middle']/table/tbody/tr[2]/td[3]/table[3]/tbody/tr[2]/td[1]")
legalClass = raw_Legal_Class.text
if not legalClass == '02.R':
iCntr += 1
print "%s (%d/%d) skipped." % (parcelNumber, iCntr, totItems)
logfile.write(parcelNumber + '(' + str(iCntr) + '/' + str(totItems) + ') skipped.\n')
continue
# 14. Capture Account Summary page data (the first page of data)
raw_Account_Number = driver.find_element_by_xpath("//*[@id='middle']/h1[1]")
raw_Parcel_Number = driver.find_element_by_xpath("//*[@id='middle']/table/tbody/tr[2]/td[1]/table/tbody/tr[1]/td[1]")
raw_Tax_Area = driver.find_element_by_xpath("//*[@id='middle']/table/tbody/tr[2]/td[1]/table/tbody/tr[2]/td[1]")
# 15. Capture the data items as python varaibles before moving to the next page.
accountNumber = re.sub('Account:', '', raw_Account_Number.text).strip()
actualParcelNumber = re.sub('Parcel Number', '', raw_Parcel_Number.text).strip() # actualParcelNumber has '-' marks in it.
taxArea = re.sub('Tax Area', '', raw_Tax_Area.text).strip()
# 16. Jump to the Parcel Detail tab
PDPageLink = driver.find_element_by_link_text('Parcel Detail')
PDPageLink.click()
# 17. Obtain page data as webdriver objects
raw_Unit_of_Measure = driver.find_element_by_xpath("//*[@id='middle']/div/span[6]")
raw_Parcel_Size = driver.find_element_by_xpath("//*[@id='middle']/div/span[8]")
# 18. Extract data from the webdriver objects, and place in python variables.
unitOfMeasure = raw_Unit_of_Measure.text.strip()
parcelSize = raw_Parcel_Size.text.strip()
# 19. Jump to Owner Information tab
OIPageLink = driver.find_element_by_link_text('Owner Information')
OIPageLink.click()
# 20. Capture Owner Information page data as webdriver objects
raw_Owner_Short_Name = driver.find_element_by_xpath("//*[@id='middle']/div/span[2]")
raw_Address1 = driver.find_element_by_xpath("//*[@id='middle']/div/span[6]/table/tbody/tr[1]/td/span[2]")
raw_City = driver.find_element_by_xpath("//*[@id='middle']/div/span[6]/table/tbody/tr[3]/td[1]/span[2]")
raw_State = driver.find_element_by_xpath("//*[@id='middle']/div/span[6]/table/tbody/tr[3]/td[2]/span[2]")
raw_Zip = driver.find_element_by_xpath("//*[@id='middle']/div/span[6]/table/tbody/tr[3]/td[3]/span[2]")
# 21. Extract data from webdriver objects, and place into python variables.
ownerShortName = raw_Owner_Short_Name.text.strip()
address1 = raw_Address1.text.strip()
city = raw_City.text.strip()
state = raw_State.text.strip()
zipCode = raw_Zip.text.strip()
# 22. Print the data items to our output file.
stringData = '"' + accountNumber + '","' + parcelNumber + '","' + actualParcelNumber + '","' + taxArea + '","' + legalClass + '","' + unitOfMeasure + '",' + parcelSize + ',"' + ownerShortName + '","' + address1 + '","' + city + '","' + state + '","' + zipCode + '"\n'
# print stringData
outfile.write(stringData)
# 23. Print a status message to the user.
iCntr += 1
print "%s (%d/%d) captured." % (parcelNumber, iCntr, totItems) logfile.write(parcelNumber + '(' + str(iCntr) + '/' + str(totItems) + ') captured.\n')
# 24. Back to main search page
driver.find_element_by_link_text('Account Search').click()
# 25. Cleanup
outfile.close
logfile.close
CODE WALK-THROUGH
1. Imports
These are the same as in the previous post, with the exception of importing the
NoSuchElementException class. This class is leveraged down in step 11 to determine if our search for a parcel ID returned no results.
2. Create the webdriver object
Again, this is the same as the previous post. We interact with the webdriver object to do things, and capture data.
3. Setup Input and Output files
This is also the same as the previous post. We need to specify our input and output files (read parcel numbers from input, write web data to output).
4. Change the mode of the output file from write to append.
This is also the same with one exception. Instead of opening our file as "write" (w), we open as "append". This way, if we run into an error, we can restart the script, and it will append to the existing data file (the "write" mode overwrites all data in the file each time the file is opened - not what we want).
5. We create a simple counter variable that increments with each record that we process.
6. As with the previous post, we loop through each row in the input file to do some set of tasks.
7. As before, we load the county page. We set a time.sleep() here to ensure the page loads, but this is the last time we will used a time.sleep().
8. Make an element visible
This was a new problem that popped up with the Apache County page. The content on the page pushed the submit button down below the bottom of the window such that it was not visible. If an element is not visible, selenium can't work with it. Imagine trying to click a button that is not visible on the page - you can't do it, your only option is to scroll down to the button, then click it. We do the same here. The 'false' parameter to the scrollIntoView() function tells Firefox to scroll down only until the entire object is visible, then stop (as opposed to placing the submit button in the middle or top of the screen).
9. Wait for the page to load
The driver.implicitly_wait() function solves a very big problem for us: 'how to ensure we don't try to start reading data items before the page is fully loaded, yet not wait indefinitely?'. The driver.implicitly_wait(x) function will wait 'x' second for the page to load completely, then allow the script to continue to the next statement. If x seconds has passed, and the page still has not loaded, a timeout will occur, and the script will throw an exception. This wait applies to the remainder of the script (every new page selected has 'x' seconds to completely load or risk a timeout), so we no longer require time.wait() function calls.
10. Search for a parcel number, and bring up the Account Summary page
Something here has changed since the previous blog post. Our input file no longer contains *only* parcel IDs. It now contains Account Numbers, and Parcel Numbers. We leverage some python magic to grab both numbers for the current record. We'll see how the two are used further down. We tell selenium to wait for the parcel box item to become visible on the page, then we fill it with the parcelNumber we obtained from the input file, and click the submit button to submit the query to the web server.
11. Check for 'parcel not found' error
In testing, a string of text of class 'warning' will be displayed in the search results error if the parcel ID being sought was not found. By searching for that error, we can handle the exception by writing a message to the screen and new log file, then continuing on to the next line in the input data. If the exception is raised (the error was *not* found), then we simply
pass to exit the try clause, and continue with trying to capture the data.
12. Handle multiple results
For this dataset, there may be multiple rows with the same parcelNumber, but each will have a different account number. This is where we leverage the account number (the first field) from the input file. In the results, we search for one containing the account number we were provided. If we get a hit, we click that row. If that fails (an exception would be thrown), we just grab the first row of class type 'clickable'.
13. Filter certain records
At this point, we have enough information visible on the screen to determine whether or not we want this record. Since the business rule I was given states. 'capture properties that consist of only land, and these have a type of '02.R'', we can drop in a simple if statement to check the value of the property type. If it's not 02.R, print a message to the user (and log the same to the logfile), skip all remaining instructions, and continue on with the next row in the input file. (Otherwise, continue with the script).
14./17./20. Capture data as webdriver objects
In these three steps, we scrape the page looking for data, and grab it using a webdriver object. The caveat here is that if we try to move on to the next page, the web elements we just captured will disappear.
15./18./21. Capture the data items as python variables. In these steps, we do some simple processing on the data (trimming excess whitespace, removing comma and dollar signs from currency values, etc.), then store the result in a python variable. That way, once we move to the next screen, if the webdriver objects disappear (and they will), we have the data we need captured with python for writing to the data file.
16./19. Move to the next page
In these steps, we move between pages of data. Since the links are simple textual links, we locate them by the text specified to show on the page, then click the link to jump to that page. I would point out that the page load wait instruction we entered in step 9. applies to these page jumps, also. Processing will not continue until the page loads. if 'x' seconds specified in the line in step 9 have passed, the script will throw an exception.
22. Output the data
Here, we create a single string (by hand) of each data item we captured from the various pages. Then, we send that string to the output file.
23. We now increment the counter variable, and print a message to the user that the data for the specific parcel id has been captured. We also write a line of the same to the log file.
24. By clicking the "Account Search" link, we go back to the initial search page, thus setting us up for the next data item in the input file.
25. As with the code in the previous blog post, we clean up our data files prior to exit.
UPDATED XPATH IDENTIFICATION
The previous post leveraged a tool that after an update to Firefox, stopped working. I went out and located another tool, FirePath. One of the advantages of FirePath is that it integrates directly into FireBug (which, if you followed the previous post, you would already be using). FirePath is simply another tab in FireBug. When inspecting elements in a web page, simply highlight the element you are interested in, and click the FirePath tab to get the XPath reference.
WRAP-UP
And that's it. We now have a much more robust, and flexible script to scrape from the web. Of course, the script is not finished. There are still a wide array of errors that could occur, and thus would require an exception handler. For example I did run across a one-in-4000+/- instance in which the script broke, presumably do to a page load timeout (re-running the script starting with the parcel ID that previously caused an error succeeded).
REFERENCES
Selenium Python API Guide