Big Data Essentials

L1: Introduction to Big Data





Yanfei Kang
yanfeikang@buaa.edu.cn
School of Economics and Management
Beihang University
http://yanfei.site

How we generate data nowadays?

  • Google has 8 services with over 1 billion users!
  • Amazon has 600+ million products!
  • The Netflix Prize dataset had 100 million ratings of 17,770 movies from 480,189 customers!
  • ...

What data are we talking about?

  • Web data.
  • Text data.
  • Time and location data.
  • Smart grid and sensor data.
  • Social network data.
  • ...

So what is big data?

Data is dirty!

  • Irregular, missing, incomplete
  • Unstructured
  • Incorrect (human error)
  • Spread out across different databases and files

     $\Downarrow$

Difficulties of collecting data will only increase!

Difficulties of analyzing data will only improve!



So we need proper tools!

What tools will we learn in this course?

  • Linux
  • Python
  • Hadoop

Further readings