Completing DSE220x Machine Learning Fundamentals

DSE220x Machine Learning Fundamentals is a course offered by UCSanDiegoX on edX. It is a 10-week course and towards the end, there is a proctored exam that tests various concepts covered in class.

This course provides an introduction to the most commonly-used machine learning techniques. The first goal is to provide a mathematical foundation on topics such as linear algebra, probability theory and statistics. The second goal is to provide a basic intuitive understanding of machine learning algorithms such as linear regression, logistic regression, support vector machine, decision trees, boosting and random forests: what they are good for, how they work, how they relate to one another, and their strengths and weaknesses. The third goal is to provide a hands-on feel for these machine learning algorithms through lab exercises with numerous data sets, using Jupyter notebooks. The fourth goal is to understand machine learning methods at a deeper level by delving into their mathematical underpinnings. This is crucial to being able to adapt and modify existing methods and to creatively combining them, also known as ensembling.

Topics covered and my comments:

  • Taxonomy of prediction problems: The course starts with a host of prediction problems as a means of introducing learners to the world of machine learning. The pioneers of machine learning algorithms were rule-based algorithm and how the need for flexibility drove machine learning algorithms to where they are today. Machine learning is probably one of the research areas which utilises multiple fields of studies in mathematics such as probability and statistics, information theory and optimization to advance the state-of-the-art.
  • Nearest neighbor methods and families of distance functions: This section gives learners intuition on nearest neighbor methods and distance functions used in machine learning algorithms. It gives learners a general understanding of how the various methods are selected and how do you go about improving the performance of performing nearest neighbor. The lectures were easy to understand and this is one of the fundamental building blocks for machine learning. It does not cover K-Means clustering algorithm which I think is a natural next step for these materials.
  • Generalization: what it means; overfitting; selecting parameters using cross-validation: In this section, we learn what happens when the machine learning model you trained generally does not generalize well to new data, also known as overfitting to the trained data. The lectures illustrate how do you deal with overfitting by introducing another split of the data called validation split where hyper-parameter tuning is performed using this validation data. K-fold cross-validation is used when the data is not sufficiently huge but it can be used when the data is huge as well.
  • Generative modelling for classification, especially using the multivariate Gaussian: In generative modelling, the machine learning algorithm attempts to learn the probability distribution of the data and using this information tries to predict the target variable or perform classification. It can be easily extended to more than two target variables and in this case, the probability distribution becomes the multivariate Gaussian. There exists probabilistic machine learning models such as Naive Bayes but sad to say, the course did not cover much of them. As this is a fundamental course, most of this content is introductory and aims to provide learners with a fundamental understanding to build upon for more advanced topics.
  • Linear regression and its variants: The most basic machine learning algorithm for regression type problems. The course covers multiple variants of linear regression such as regularised linear regression, the intuition behind it and what kind of issues it addresses. It is easy to breeze through this content but I guess this is one of the fundamental building blocks of machine learning.
  • Logistic regression: Logistic regression is just a fancy name for using regression to perform classification. The key intuition is to determine thresholds on the data so as to classify the data accordingly. It is a statistical model which uses a logistic function as the name implies to model a binary dependent variable, True or False. It is easily extended to multi-class logistic regression where the dependent variable has multiple states or classes.
  • Optimization: deriving stochastic gradient descent algorithms and testing convexity: One common numerical optimization technique used in machine learning is gradient descent. It is commonly used to update the weights of a neural network and determine the linear boundaries of machine learning models such as support vector machines. The math of gradient descent and convex algorithms were covered pretty well in the lectures but I do feel that the materials were too theoretical and more attention should be placed on the applications of gradient descent and convexity. More examples will definitely help.
  • Linear classification using the support vector machine: Support vector machines used to be very popular in the world of machine learning where datasets were small and before the repopularization of neural networks which spawned an entirely new area of machine learning known as deep learning. The key idea behind SVM is to maximise the margin between the positive and negative test cases and not surprisingly, support vectors which are points in the dataset is used to create this boundary or margin to create a classifier.
  • Nonlinear modeling using basis expansion and kernel methods: It is kind of interesting that you can actually use SVM with non-linear kernels which creates non-linear boundaries for classification problems and basis expansion whereby a basis function is used so that the machine learning algorithm can learn the non-linear decision boundary. The lectures provide an intuitive understanding on the math behind non-linear modelling using kernel methods and basis expansion. Very mathematical but well-explained!
  • Decision trees, boosting, and random forests: Decision trees is a supervised machine learning algorithm used for classification or regression. Starting from the primary or base node, we work our way down the tree by performing splits based on different conditions of features. Decision trees can be used to visually represent decisions and decision making. Boosting is an ensemble machine learning algorithm where we attempt to put weights on numerous weak learners in order to obtain strong learners. There were a time where machine learning competitions were dominated by boosting algorithms and as of today, they are still hard to beat. Random forest precedes decision trees as it is an ensemble machine learning algorithm that attempts to construct a multitude of uncorrelated decisions trees randomly and when combined, perform better than a single tree.
  • Methods for flat and hierarchical clustering: In hierarchical clustering, most of the time it is just K-Means algorithm where we attempt to cluster the data based on various distance metrics such as Euclidean and cosine similarity. We take a look at the pairwise distance of points in two different clusters and determine the most effective measure of similarity depending on the current context. It is commonly used to cluster news articles. There is no hard and fast rule to select the number of clusters. As the K-Means algorithm is quasi-convex, multi-local minima exist and you will find that the resultant clusters after each run might be different. Multiple iterations will be necessary to obtain an optimal solution.
  • Principal component analysis: Dimensionality reduction is a classic topic that is still very relevant today given the curse of dimensionality of huge datasets. Principal component analysis is a common technique for feature extraction. Essentially the idea is we project our data into a lower dimensional space while preserving the important features of our data and hence, reducing the dimension of our data. An important concept is the proportion of variance explained of the principal component analysis. By having a high value of say close to 99%, we minimise the reconstruction error and preserve the important features of the high-dimensional data.
  • Autoencoders, distributed representations, and deep learning: The lectures on deep learning were very brief and mostly touch and go. The lecturer jumped straight into autoencoders which I feel was too huge of a jump. I feel the concepts of a basic neural network, a perceptron should come before it. I would be keener on the deep learning frameworks such as TensorFlow and PyTorch and how you use them to perform analysis on large datasets.

Check out the course on edX @ https://courses.edx.org/courses/course-v1:UCSanDiegoX+DSE220x+1T2020a/course/

My course certificate! = https://courses.edx.org/certificates/51895d7acbfa4c41a66e55e9778496cb

That’s all! Thanks for reading!

I would like to use this opportunity to thank course instructor Sanjoy Dasgupta for hosting this course on edX. Thank you!

Leave a comment

Design a site like this with WordPress.com
Get started