School of Economics and Management
Beihang University
http://yanfei.site

How we generate data nowadays?

  • Google has 8 services with over 1 billion users!
  • Amazon has 600+ million products!
  • The Netflix Prize dataset had 100 million ratings of 17,770 movies from 480,189 customers!

What data are we talking about?

  • Web data.
  • Text data.
  • Time and location data.
  • Smart grid and sensor data.
  • Social network data.

So what is big data?

Data is dirty!

  • Irregular, missing, incomplete
  • Unstructured
  • Incorrect (human error)
  • Spread out across different databases and files

     \(\Downarrow\)

Difficulties of collecting data will only increase!

Difficulties of analyzing data will only improve!

So we need proper tools!

What tools will we learn in this course?

  • Linux
  • Python
  • Hadoop