TDM 20200: Project 3 — 2023
Motivation: Web scraping takes practice, and it is important to work through a variety of common tasks in order to know how to handle those tasks when you next run into them. In this project, we will use a variety of scraping tools in order to scrape data from zillow.com.
Context: In the previous project, we got our first taste at actually scraping data from a website, and using a parser to extract the information we were interested in. In this project, we will introduce some tasks that will require you to use a tool that let’s you interact with a browser, selenium.
Scope: python, web scraping, selenium
Questions
Question 1
Pop open a browser and visit zillow.com. Many websites have a similar interface — a bold and centered search bar for a user to interact with.
First, in your browser, type in 32607
into the search bar and press enter/return. There are two possible outcomes of this search, depending on the computer you are using and whether or not you’ve been browsing zillow. The first is your search results. The second a page where the user is asked to select which type of listing they would like to see.
This second option may or may not consistently pop up. For this reason, we’ve included the relevant HTML below, for your convenience.
<div class="yui3-lightbox interstitial">
<div class="yui3-lightbox-content lightbox_responsive">
<a class="zsg-icon-x-thin lightbox-close" title="Close" tabindex="-1"></a>
<div class="lightbox-body zsg-tooltip-viewport">
<div>
<img alt="" src="https://www.zillowstatic.com/s3/homepage/static/interstitial_graphic.png" class="sc-14dvu6m-0 iYqEdo " width="262px" height="100px">
<h3 id="interstitial-title" class="sc-14dvu6m-1 kvYidp ">What type of listings would you like to see?</h3>
<ul class="sc-14dvu6m-2 gfkDFS listing-interstitial-buttons">
<li>
<button class="StyledButton-c11n-8-69-2__sc-wpcbcc-0 lebuee">For sale</button>
</li>
<li>
<button class="StyledButton-c11n-8-69-2__sc-wpcbcc-0 lebuee">For rent</button>
</li>
<li>
<button class="StyledTextButton-c11n-8-69-2__sc-n1gfmh-0 iHnAnn">Skip this question</button>
</li>
</ul>
</div>
</div>
</div>
</div>
Remember that the value of an element is the text that is displayed between the tags. For example, the following element has "happy" as its value.
You can use XPath expressions to find elements by their value. For example, the following XPath expression will find all //div[text()='happy'] |
Use selenium
, and write Python code that first finds the search bar input
element of the zillow.com home page. Then, use selenium
to emulate typing the zip code 32607
into the search bar followed by a press of the enter/return button.
Confirm your code works by printing the current URL of the page after the search has been performed. What happens?
To print the URL of the current page, use the following code.
|
Well, it is likely that the URL is unchanged. Remember, the "For sale", "For rent", "Skip this question" page may pop up, and this page has the same URL. To confirm this, instead of printing the URL, instead print the entirety of the HTML provided above. To do so, first use xpath expressions to isolate the outermost div
element, then print the the entire element, including all of the nested elements.
To print the HTML of an element using
If you don’t know what HTML to expect, the
Of course, please only print the isolated element — we don’t want to print it all — that would be a lot! |
You can use the class 'yui3-lightbox interstitial'. |
Remember, in the background, |
One downside to selenium is it has some more boilerplate code than,
Please feel free to "reset" your driver (for example, if you’ve lost track of "where" it is (what webpage, for example) or you aren’t getting results you expected) by running the following code, followed by the code shown above.
|
-
Code used to solve this problem.
-
Output from running the code.
Question 2
Okay, let’s go forward with the assumption that we will always see the "For sale", "For rent", and "Skip this question" page. We need our code to handle this situation and click the "Skip this question" button so we can get our search results!
Write Python code that uses selenium
to find the "Skip this question" button and click it. Confirm your code works by printing the current URL of the page after the button has been clicked. Is the URL what you expected?
Don’t forget, it may be best to put a |
Uh oh! If you did this correctly, it is likely that the URL is not quite right — something like: www.zillow.com/homes/rb/
, or maybe a captcha page. By default, websites employ a variety of anti-scraping techniques. On the bright side, we _did notice (when doing this search manually) that the URL should look like: www.zillow.com/homes/32607_rb/
— we can just insert our zip code directly in the URL and that should work without any fuss, plus we save some page loads and clicks. Great! Alternatively, you could also narrow down the search to homes "For Sale" by using www.zillow.com/homes/for_sale/32607_rb/
.
If you are paying close attention — you will find that this is an inconsistency between using a browser manually and using |
Test out (using selenium
) that simply inserting the zip code in the URL works as intended. Finding the title
element and printing the contents should verify quickly that it works as intended. Bake this functionality into a function called print_title
that takes a search term, search_term
, and returns the contents of the title
element.
element = driver.find_element("xpath", "//title")
print(element.get_attribute("outerHTML"))
# make sure this works
print_title("32607")
-
Code used to solve this problem.
-
Output from running the code.
Question 3
Okay great! Take your time to open a browser to www.zillow.com/homes/for_sale/32607_rb/
and use the Inspector to figure out how the web page is structured. For now, let’s not worry about any of the filters. The main useful content is within the cards shown on the page. Price, number of beds, number of baths, square feet, address, etc., is all listed within each of the cards.
What non li
element contains the cards in their entirety? Use selenium
and XPath expressions to extract those elements from the web page. Print the value of the id
attributes for all of the cards. How many cards was there? (this could vary depending on when the data was scraped — that is ok).
You can use the Some examples of how to use //div[starts-with(@id, 'card_')] # all divs with an id attribute that starts with 'card_' //div[starts-with(text(), 'okay_')] # all divs with content that starts with 'okay_' |
-
Code used to solve this problem.
-
Output from running the code.
Question 4
How many cards were there? For me, there were just about 8. Let’s verify things. Open a browser to www.zillow.com/homes/for_sale/32607_rb/
and scroll down to the bottom of the page. How many cards are there?
For me, there were significantly more than 8. There were 40. Something is going on here. What is going on is lazy loading. What this means is the web page is only loading the first 8 cards, and then loading the rest of the cards as you scroll down the page. This is a common technique to reduce the initial load time of a web page. This is also the perfect scenario for us to really utilize the power of selenium
. If we just use a package like requests
, we are unable to scroll down the page and load the rest of the cards.
Check out the function below called load_all_cards
that accepts the driver
as an argument, and scrolls down the page until all of the cards have been loaded. Examine the function and explain (in a markdown cell) what it is doing. In addition, use the function in combination with your code from the previous question to print the id
attribute for all of the cards. How many cards were there this time?
def load_all_cards(driver):
cards = driver.find_elements("xpath", "//article[starts-with(@id, 'zpid')]")
while True:
try:
num_cards = len(cards)
driver.execute_script('arguments[0].scrollIntoView();', cards[num_cards-1])
time.sleep(2)
cards = driver.find_elements("xpath", "//article[starts-with(@id, 'zpid')]")
if num_cards == len(cards):
break
num_cards = len(cards)
except StaleElementReferenceException:
# every once in a while we will get a StaleElementReferenceException
# because we are trying to access or scroll to an element that has changed.
# this probably means we can skip it because the data has already loaded.
continue
-
Code used to solve this problem.
-
Output from running the code.
Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted. In addition, please review our submission guidelines before submitting your project. |