Course Description
Advanced data science techniques for analyzing large-scale data in distributed environments. Students will develop scalable algorithms in frameworks such as Spark and Flink. This course is team-based, involving several mini-projects over the course of the semester with a competition as the final project.
Athena Title
Data Science Practicum
Prerequisite
CSCI 4360/6360 or CSCI 4380/6380 or permission of department
Semester Course Offered
Not offered on a regular basis.
Grading System
A - F (Traditional)
Course Objectives
The course aims to provide students with a hands-on practicum for studying scalable machine learning on distributed frameworks. Students will have the opportunity to implement algorithms in “traditional” MapReduce paradigms, as well as next-generation large-scale compute frameworks using standard batch processing, streaming, and graph analytics. This course is ideal for students who anticipate working in quantitative fields such as biomedical imaging or natural language processing, or plan to enter a data science position in industry.
Topical Outline
• Overview of Data Science • Review of basic machine learning, probability, and statistics • Theory and practice of distributed computing • Distributed analytics frameworks (Hadoop, Spark) • Performing large-scale classification (Random Forests, Naïve Bayes) • Clustering unstructured data (K-means, spectral clustering) • Natural language processing (LDA) • Distributed graph analytics (PageRank) • Large-scale image analysis (Deep Learning) • Dimensionality reduction at scale (PCA, Bloom filters, stochastic SVD) • Scaling randomized algorithms • Alternative distributed frameworks (GraphLab Create, Flink)