Course Description
This course explores supervised learning methods for classification, such as logistic regression, decision trees, and support vector machines, and unsupervised clustering methods, such as k-means, hierarchical, spectral, and density-based spatial clustering. Through Python-based projects and real-world case studies, students develop practical skills to analyze datasets, fit models, and interpret results.
Athena Title
Clustering and Classification
Non-Traditional Format
This course will be taught 95% or more online.
Prerequisite
STAT 6381E and STAT 6382E and STAT 6383E
Corequisite
STAT 6384E
Semester Course Offered
Offered summer semester every year.
Grading System
A - F (Traditional)
Student learning Outcomes
- Students will differentiate between classification and clustering techniques by examining and comparing their key characteristics and assumptions through applications in data science.
- Students will preprocess and explore data effectively by handling missing values, scaling features, and applying dimension reduction techniques to improve model performance.
- Students will fit and evaluate classification models such as logistic regression, decision trees, and support vector machines using appropriate metrics and validation techniques.
- Students will apply various clustering algorithms (e.g., K-means, hierarchical clustering, DBSCAN, spectral clustering) to uncover patterns in unlabeled data.
- Students will optimize machine learning models through hyperparameter tuning and regularization to improve classification and clustering performance.
- Students will develop and execute end-to-end data science projects by integrating data preprocessing, modeling, and visualization techniques.
- Students will utilize Python libraries efficiently (e.g., scikit-learn, pandas, NumPy, Matplotlib, Seaborn) to implement classification and clustering workflows in practical applications.
Topical Outline
- Introduction to Classification and Clustering
• Overview of supervised and unsupervised learning.
• Key differences between classification and clustering.
• Common applications in data science.
- Data Preprocessing and Exploratory Data Analysis (EDA)
• Handling missing data, scaling, and normalization.
• Feature selection and feature engineering for classification and clustering.
• Exploratory visualization techniques: scatter plots, pair plots, dendrograms, biplots.
• Dimensionality reduction methods (e.g., PCA) for clustering and classification.
- Supervised Learning: Classification Techniques
a. Logistic Regression
• Binary and multiclass logistic regression.
• Assumptions, decision boundaries, and interpretability.
• Regularization techniques: L1 and L2 penalties.
b. Decision Trees
• Building and interpreting decision trees.
• Splitting criteria: Gini index, entropy, and information gain.
• Pruning and preventing overfitting.
c. Support Vector Machines (SVM)
• Understanding the SVM algorithm and kernel trick.
• Linear vs. non-linear classification.
• Hyperparameter tuning: C and kernel parameters.
d. Model Evaluation for Classification
• Metrics: accuracy, precision, recall, F1-score, and ROC curves.
• Cross-validation and performance comparison.
- Unsupervised Learning: Clustering Techniques
a. K-Means Clustering
• Algorithm mechanics: initialization, iteration, and convergence.
• Choosing the optimal number of clusters.
• Applications and limitations.
b. Hierarchical Clustering
• Agglomerative vs. divisive approaches.
• Linkage criteria: single, complete, and average.
• Visualizing clusters with dendrograms.
c. Spectral Clustering
• Graph-based clustering concepts.
• Laplacian matrices and eigenvectors.
• Applications to non-linearly separable data.
d. Density-Based Spatial Clustering (DBSCAN)
• Core concepts: density reachability and connectivity.
• Parameters: epsilon (eps) and minimum points (minPts).
• Identifying noise and outliers.
e. Evaluating Clustering Performance
• Internal metrics: silhouette score, Davies-Bouldin Index.
• External metrics: purity, adjusted Rand index.
- Practical Applications and Case Studies
• Solving real-world problems with classification and clustering.
• Working with domain-specific datasets (e.g., healthcare, finance, marketing).
• Project design and execution: from data preprocessing to results interpretation.
- Python Tools and Implementation
• Introduction to key libraries: scikit-learn, pandas, NumPy, Matplotlib, and Seaborn.
• Writing efficient code for classification and clustering.
• Hyperparameter tuning with GridSearchCV and RandomizedSearchCV.
- Hands-On Projects
• End-to-end projects integrating classification and clustering techniques.
• Building workflows for problem-solving.
• Presenting findings through visualizations and reports.
Institutional Competencies Learning Outcomes
Analytical Thinking
The ability to reason, interpret, analyze, and solve problems from a wide array of authentic contexts.
Creativity & Innovation
The capacity to combine or synthesize existing ideas, images, or expertise in original ways and the experience of thinking, reacting, and working in an imaginative way characterized by innovation, divergent thinking, and risk taking.
Leadership & Collaboration
The capacity to engage in the relational process of optimizing personal and collective strengths toward a common goal.