Module 1 – Introduction to Data Science: Introduction to fault-tolerant distributed file systems and computing. The whole data science process illustrated with industrial case-studies. A practical introduction to the scalable data processing to ingest, extract, load, transform, and explore (un)structured datasets. Scalable machine learning pipelines to model, train/fit, validate, select, tune, test, and predict or estimate in an unsupervised and supervised setting using nonparametric and partitioning methods such as random forests. Introduction to distributed vertex-programming.
In the second day of module-1, we mainly talk about MLlib, the Spark Machine Learning (ML) module. We also present regression and classification as the fundamental of ML, and show how to implement them using MLlib. Moreover, we discuss about the graph processing component in Spark, called GraphX, and show to process graphs using it.
Download all the content from: