Web scraping with Selenium
web driver¶
Since Selenium WebDriver is created for browser automation, it can be easily used for scraping data from the web. Selenium is to select and navigate the components of a website that are non-static and need to be clicked or chosen from drop-down menus.
If there is any content on the page rendered by javascript then Selenium webdriver wait for the entire page to load before crwaling whereas other libs like BeautifulSoup,Scrapy and Requests works only on static pages.
Any browsyer actions can be done with the help of Selenium webdriver, if there is any content on the page displayed by on button click or Scrolling or Page Navigation.
I suggest you run this ipynb locally.
Pros of using WebDriver¶
- WebDriver can simulate a real user working with a browser
- WebDriver can scrape a web site using a specific browser
- WebDriver can scrape complicated web pages with dynamic content
- WebDriver is able to take screenshots of the webpage
Cons of using WebDriver¶
- The program becomes quite large
- The scraping process is slower
- The browser generates a bigger network traffic
- The scraping can be detected by such simple means as Google Analytics
Web Scraping Bing with Selenium
Firefox driver¶
- Let’s now load the main bing search page and makes a query to look for “Yanfei Kang”.
- You need to install
selenium
module for Python. - You also need
geckodriver
and place it in a directory where$PATH
can find. You could download it from https://github.com/mozilla/geckodriver/releases.
In [1]:
from selenium import webdriver
from selenium.webdriver.common.by import By
driver = webdriver.Firefox()
In [2]:
driver.get("https://www.bing.com/")
In [3]:
driver.find_element(By.ID, "sb_form_q").send_keys("Yanfei Kang")
In [4]:
driver.find_element(By.ID, "search_icon").click()
In [5]:
driver.close()
Web Scraping Baidu¶
In [18]:
from selenium import webdriver
driver = webdriver.Firefox()
In [22]:
driver.get("https://www.baidu.com/")
driver.find_element(By.ID, "kw").send_keys("北航康雁飞")
In [23]:
driver.find_element(By.ID, "su").click()
In [24]:
results = driver.find_elements(By.XPATH, '//h3/a')
for result in results:
print(result.text)
康雁飞-北航经济管理学院 康雁飞 - 百度百科 北京航空航天大学主页平台系统 康雁飞--中文主页--研究领域 经管人物|康雁飞:从百度高级工程师到北航副教授-北航经济... 康雁飞 – Yanfei Kang, Ph.D. 康雁飞 - 北京航空航天大学 - 经济管理学院 康雁飞简介_北京航空航天大学经济管理学院博士生导师:康... 北京航空航天大学经济管理学院导师教师师资介绍简介-康雁... 统计与数学学院召开2017年北京预测研讨会-中央财经大学统... 北航经管学院研究生 教学评估卡-08114302科学写作与报告--...
driver.close()