Big Data Essentials

L1: Introduction to Big Data

Yanfei Kang
School of Economics and Management
Beihang University

How we generate data nowadays?

  • Google has 8 services with over 1 billion users!
  • Amazon has 600+ million products!
  • The Netflix Prize dataset had 100 million ratings of 17,770 movies from 480,189 customers!
  • ...

What data are we talking about?

  • Web data.
  • Text data.
  • Time and location data.
  • Smart grid and sensor data.
  • Social network data.
  • ...

So what is big data?

Data is dirty!

  • Irregular, missing, incomplete
  • Unstructured
  • Incorrect (human error)
  • Spread out across different databases and files


Difficulties of collecting data will only increase!

Difficulties of analyzing data will only improve!

So we need proper tools!

What tools will we learn in this course?

  • Linux
  • Python
  • Hadoop

Further readings