Course Learning Objectives#
The course is designed for a diverse group of students with varying backgrounds, including those from Computer Science and Statistics. It serves as a gentle introduction to machine learning, yet its applied nature is valuable even for those already familiar with the field. The curriculum covers foundational concepts of machine learning and data science, with topics ranging from data preprocessing and supervised learning to clustering, recommendation systems, text data processing, a high-level introduction to neural networks, time series, and survival analysis. A key focus of the course is developing hands-on skills in model development, evaluation, interpretation, ethical considerations, and clear communication.
By the end of this course, the students should be able to:
Describe supervised learning and its suitability for various tasks.
Explain key machine learning concepts such as classification, regression, overfitting, and the trade-off in model complexity.
Identify appropriate data preprocessing techniques for specific scenarios, provide reasons for their selection, and integrate them into machine learning pipelines.
Develop an intuitive understanding of common supervised machine learning algorithms.
Build end-to-end supervised machine learning pipelines using Python and scikit-learn on different types of datasets.
Understand and differentiate between various evaluation metrics used for classification (e.g., accuracy, precision, recall, F1-score, AP score, AUC-ROC) and regression (e.g., mean absolute error, mean squared error, R-squared). Apply the appropriate evaluation metric based on the problem context and interpret the results to assess model performance.
Recognize the significance of feature engineering in improving model performance.
Compare and contrast various feature selection techniques such as model-based feature selection and recursive feature elimination.
Analyze and interpret feature importances to gain insights into the relevance of different features.
Understand the fundamental concepts behind ensemble methods, including averaging and stacking. Use popular ensemble models like Random Forest, LGBM, CatBoost and appreciate their advantages in improving predictive accuracy and mitigating overfitting.
Understand the principles and algorithms behind clustering methods such as K-means, hierarchical clustering, and DBSCAN. Apply clustering techniques to segment data into meaningful groups.
Intuitively understand the core concepts behind recommendation algorithms, including collaborative filtering and content-based filtering.
Delve into word embeddings like Word2Vec and GloVe, understanding their significance in capturing semantic relationships in textual data.
Develop an intuitive understanding of neural networks, their advantages and drawbacks in machine learning contexts, and their superiority in handling image data.
Familiarize yourself with time series data, its appropriate use cases, how to manage data splitting challenges, conduct feature engineering, forecast future time points, and grasp core concepts such as trends and irregular time intervals.
Acquire an understanding of right-censored data, its implications, and the importance of specialized approaches like survival analysis; apply, interpret, and make predictions using tools such as the lifelines package, Kaplan-Meier curves, and Cox proportional hazards models in Python.
Develop an understanding of the ethical implications surrounding data collection, processing, modeling, and interpretation; critically evaluate potential biases, privacy concerns, and societal impacts.
Cultivate advanced communication skills tailored to diverse audiences, emphasizing reader-centric writing, contextual understanding, and critical evaluation of visualizations to ensure clarity, accuracy, and relevance in conveying ML insights and implications to stakeholders.
Describe the goals and challenges of model deployment.