Big Data Essentials¶

L1: Introduction to Big Data¶





Yanfei Kang
yanfeikang@buaa.edu.cn
School of Economics and Management
Beihang University
http://yanfei.site

How we generate data nowadays?¶

  • Google has over 200 products with over 4.3 billion users!

  • Amazon has more than 12 million products!

  • The Netflix Prize dataset had 100 million ratings of 17,770 movies from 480,189 customers!

  • ...

What data are we talking about?¶

  • Web data.
  • Text data.
  • Time and location data.
  • Smart grid and sensor data.
  • Social network data.
  • ...

So what is big data?¶

Data is dirty!¶

  • Irregular, missing, incomplete
  • Unstructured
  • Incorrect (human error)
  • Spread out across different databases and files

     $\Downarrow$

Difficulties of collecting data will only increase!

Difficulties of analyzing data will only improve!



So we need proper tools!¶

What tools will we learn in this course?¶

  • Linux
  • Git
  • Python (modelling)
  • Distributed systems and distributed computing

Remember Google's pagerank?¶

Further readings¶

  • Chapter 1 of Textbook.