Web scraping with Selenium web driver

Since Selenium WebDriver is created for browser automation, it can be easily used for scraping data from the web. Selenium is to select and navigate the components of a website that are non-static and need to be clicked or chosen from drop-down menus.

If there is any content on the page rendered by javascript then Selenium webdriver wait for the entire page to load before crwaling whereas other libs like BeautifulSoup,Scrapy and Requests works only on static pages.

Any browsyer actions can be done with the help of Selenium webdriver, if there is any content on the page displayed by on button click or Scrolling or Page Navigation.

Pros of using WebDriver

  • WebDriver can simulate a real user working with a browser
  • WebDriver can scrape a web site using a specific browser
  • WebDriver can scrape complicated web pages with dynamic content
  • WebDriver is able to take screenshots of the webpage

Cons of using WebDriver

  • The program becomes quite large
  • The scraping process is slower
  • The browser generates a bigger network traffic
  • The scraping can be detected by such simple means as Google Analytics

Web Scraping Bing with Selenium Firefox driver

Let’s now load the main bing search page and makes a query to look for “feng li”: You need to install selenium module for Python. You also need geckodriver and place it in a directory where $PATH can find. You could download it from https://github.com/mozilla/geckodriver/releases .

In [3]:
from selenium import webdriver
driver = webdriver.Firefox()
In [4]:
driver.get("https://www.bing.com/")
In [7]:
driver.find_element_by_id("sb_form_q").send_keys("Yanfei Kang")
In [8]:
driver.find_element_by_id("sb_form_go").click()
In [23]:
#driver.close()
#driver.quit()

Web Scraping Baidu

In [28]:
from selenium import webdriver
driver = webdriver.Firefox()
In [29]:
driver.get("https://www.baidu.com/")
driver.find_element_by_id("kw").send_keys("康雁飞")
In [30]:
driver.find_element_by_id("su").click()
results = driver.find_elements_by_xpath('//h3/a')
In [31]:
for result in results:
    print(result.text)
In [32]:
driver.close()

Lab

Use selenium to implement the case we studied with BeautifulSoup in L2.