Course ID: | CSCI 8360. 4 hours. |
Course Title: | Data Science Practicum |
Course Description: | Advanced data science techniques for analyzing large-scale data
in distributed environments. Students will develop scalable
algorithms in frameworks such as Spark and Flink. This course is
team-based, involving several mini-projects over the course of
the semester with a competition as the final project. |
Oasis Title: | Data Science Practicum |
Prerequisite: | CSCI 4360/6360 or CSCI 4380/6380 or permission of department |
Semester Course Offered: | Not offered on a regular basis. |
Grading System: | A-F (Traditional) |
|
Course Objectives: | The course aims to provide students with a hands-on practicum
for studying scalable machine learning on distributed
frameworks. Students will have the opportunity to implement
algorithms in “traditional” MapReduce paradigms, as well as
next-generation large-scale compute frameworks using standard
batch processing, streaming, and graph analytics. This course
is ideal for students who anticipate working in quantitative
fields such as biomedical imaging or natural language
processing, or plan to enter a data science position in
industry. |
Topical Outline: | • Overview of Data Science
• Review of basic machine learning, probability, and statistics
• Theory and practice of distributed computing
• Distributed analytics frameworks (Hadoop, Spark)
• Performing large-scale classification (Random Forests, Naïve
Bayes)
• Clustering unstructured data (K-means, spectral clustering)
• Natural language processing (LDA)
• Distributed graph analytics (PageRank)
• Large-scale image analysis (Deep Learning)
• Dimensionality reduction at scale (PCA, Bloom filters,
stochastic SVD)
• Scaling randomized algorithms
• Alternative distributed frameworks (GraphLab Create, Flink) |