Course Objectives: | 1. Use existing Python tools to read and preprocess raw data of various formats (text, images, binary).
2. Choose a proper statistical model for extracting knowledge from a particular dataset, given the advantages and disadvantages of the model.
3. Implement at least one algorithm from the categories of regression, classification, clustering, and convex optimization.
4. Design and document analytical pipelines to be reproducible by others.
5. Use and interpret the results of dimensionality reduction on high-dimensional datasets.
6. Choose the most effective visualization to convey the knowledge learned from the data. |
Topical Outline: | 1. Scientific programming with Python
a) Understand basic Python programming language features such as indentation, modules, functions, boolean expressions, strings, control flow, zip, and basic object-oriented programming.
b) Be able to use various built-in Python data structures to store and access data in order to solve problems. Data structures include a list, tuple, dictionary, counter, and set.
c) Make use of various commonly used toolkits and modules, such as Jupyter notebook, pandas, numpy, sklearn, scipy, matplotlib, keras, and tensorflow, to solve appropriate data science problems.
2. Data visualization
a) Given a dataset, create an appropriate plot to visualize the trend, relationships, and outliers within the data.
b) Compare and contrast various plots, including bar chart, pie chart, histogram, line plot, and scatter plot, and choose the most effective plot(s) to communicate the data.
3. Convex optimization
a) Compute the gradient using partial derivatives when feasible and estimate the gradient using difference quotients.
b) Perform multiple steps of gradient descent towards finding the minimum when given a univariate differentiable objective function.
4. Dimensionality Reduction
a) Using greedy forward selection and backward elimination algorithms to find a subset of features suitable for regression and classification models.
b) Introduces feature projection algorithms such as Principal Component Analysis.
5. Supervised Classification
a) Demonstrate understanding of the k-nearest neighbor model by computing distances between data points to identify k-nearest neighbors and conclude the appropriate class label that should be given to a data point.
b) Demonstrate understanding of the naïve Bayes model by computing the appropriate conditional probabilities based on Bayes’ theorem in order to classify a data point.
c) Demonstrate understanding of the decision tree model by building a decision tree classifier using the ID3 algorithm when given an appropriate dataset and using the tree to classify data points.
6. Regression
a) Fit simple and multiple linear regression models to data. Be able to interpret the coefficients and compute the R-squared metric.
b) Apply regularization techniques such as ridge and lasso regression.
c) Build polynomial regression models with interactive features on a given dataset.
7. Unsupervised clustering
a) Demonstrate understanding of the K-means algorithm by repeatedly computing new cluster means and reassigning data points to clusters until the assignment of data points stabilizes.
b) Introduces the hierarchical clustering algorithm and applies it to a dataset.
8. Introduction to "big data" and "deep learning"
a) Understand the topology of feedforward neural networks and be able to complete the feedforward process when given the dataset, weights, biases, and activation functions.
b) Introduces the backpropagation algorithm using a simple neural network structure.
c) Introduces "big data," "deep learning," convolutional neural networks, and their applications in image recognition. |