How we generate data nowadays?¶
Google has over 200 products with over 4.3 billion users!
Amazon has more than 12 million products!
The Netflix Prize dataset had 100 million ratings of 17,770 movies from 480,189 customers!
...
What data are we talking about?¶
- Web data.
- Text data.
- Time and location data.
- Smart grid and sensor data.
- Social network data.
- ...
So what is big data?¶
Data is dirty!¶
- Irregular, missing, incomplete
- Unstructured
- Incorrect (human error)
- Spread out across different databases and files
$\Downarrow$
Difficulties of collecting data will only increase!
Difficulties of analyzing data will only improve!
What tools will we learn in this course?¶
- Linux
- Git
- Python (modelling)
- Distributed systems and distributed computing