Data Science, Machine Learning, Deep Learning, and Artificial Intelligence are some of the popular buzzwords in the analytics Eco space. These technologies were there in the past as well but the recent hype is due to the large volumes of structured, and unstructured data that is getting generated and the massive computational capacity that modern computers possess.
Let’s understand what is Machine Learning? Machine Learning is the study of training machine with historical data to build predictive models for the unknown datasets. Most companies these days are accepting ML in their architecture to speed up their workflow and automate tasks which needed repetitive human intervention. There are several pre-programmed algorithms which are used to build such predictive models and solve either classification, regression, or clustering problems.
The state-of-the-art Machine Learning algorithms could be classified into –
- Supervised Machine Learning – The dataset used in Supervised Learning is labeled which means for each row there is a target variable given. The model is trained with the supervised training set and then tested on the unknown data. Linear Regression, Logistic Regression, etc., are some Supervised Machine Learning algorithm.
- Un-Supervised Learning – Unlike supervised learning, Un-Supervised Machine Learning algorithm, the dataset is unlabelled and needs to be grouped together based on the similarity among the data points. K-Means clustering, Apriori are some of the algorithms used for clustering the data points into different groups.
- Reinforcement Learning – A special type of Machine Learning where the model learns from past actions and it is rewarded for every correct move and penalized for any wrong move taken. Google’s AlphaGo is an example of a Reinforcement Learning application.
Now that we understood the machine learning introduction, we would now look into the essentials of the Machine Learning algorithm in Python.
1. Linear Regression:
In a supervised learning problem, the target variable could be numeric or discrete in nature. Linear regression is one of the first algorithms one should master which takes into account the linear relationship between the independent variables and the continuous dependent variable.
The Univariate Linear Regression in machine learning is represented by y = a*x + b while the multivariate linear regression is represented by y = a + x(1)b(1) + x(2)b(2) +….+ x(n)b(n).
The goal in Linear Regression Machine Learning algorithm is to reduce the cost function to its global minimum with the technique known as Gradient Descent where the value of the coefficient is updated after each iteration until it converges.
>>> import numpy as np >>> from sklearn.linear_model import LinearRegression >>> X = np.array([[1,1], [1,2], [2,2], [2,3]]) >>> # y = 1*x_0 + 2*x_1 + 3 >>> y = np.dot(X, np.array([1,2])) + 3 >>> reg = LinearRegression().fit(X,y) >>> reg.score(X,y) 1.0 >>> reg.coef_ array([1.,2.]) >>> reg.intercept_ 3.0000... >>> reg.predict(np.array([[3,5]])) array([16.])
2. Logistic Regression:
Unlike Linear Regression, the target variable in the Logistic Regression Machine Learning algorithm is discrete in nature which could be binary, multinomial or ordinal. In the Binary Classification problem, the output is either 0/1, True/False, and so on. The activation function used here is known as the Sigmoid function which is the log of odds in favor.
The Sigmoid function takes a lot of time to compute and hence for multiclass problems, the softmax function is used.
The ROC curve is often the go-to metric to evaluate a machine learning classification model.
The Python code for logistic regression machine learning –
>>> from sklearn.datasets import load_iris >>> from sklearn.linear_model import LogisticRegression >>> X,y = load_iris(return_X_y=True) >>> clf = LogisticRegression(random_state=0, solver='lbfgs', multi_class='multinomial').fit(X,y) ... >>> clf.predict(X[:2, :]) array([0,0]) >>> clf.predict_proba(X[:2, :]) array([[9.8...e-01, 1.8...e-02, 1.4...e-08], [9.7...e-01, 2.8...e-02, ....e-08]]) >>> clf.score(X,y) 0.97
3. Support Vector Machines:
SVM in machine learning is used in classification problems where a hyperplane separates two classes. The vectors used to decide the optimal position of the hyperplane are known as the support vectors and it ensures maximum separation from the hyperplane to the classes.
It’s important to tune the parameters such as the kernel, gamma, etc., in SVM. Based on the data the kernel could be linear or polynomial. Additionally, the regularization value or the value of C should be optimal to prevent overfitting and underfitting.
>>> from sklearn import svm >>> X = [[0,0], [1,1]] >>> y = [0,1] >>> clf = svm.SVC(gamma ='scale') >>> clf.fit(X, y) SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, decision_funaction_shape='ovr', degree=3, gamma='scale', kernel='rbf', max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False)
>>> clf.predict([[2., 2.]]) array()
4. Naïve Bayes:
Naïve Bayes algorithm works on both classification and regression problems. It works on the principle of Bayes Theorem which is the like hood of an event occurring considering some condition is true.
The algorithm is called Naïve because it believes the presence of one feature is independent of another and considers all features to be equally important in predicting the outcome.
>>> from sklearn import datasets >>> iris = datasets.load_iris() >>> from sklearn.naive_bayes import GaussianNB >>> gnb = GaussianNB() >>> y_pred = gnb.fit(iris.data, iris.target).predict(iris.data) >>> print("Number of mislabeled points out of a total %d points : %d" ... % (iris.data.shape,(iris.target!=y_pred).sum())) Number of mislabeled points out of a total 150 points : 6
5. K-Nearest Neighbors:
KNN or K-Nearest Neighbors classifies each data point based on the mode of the k neighbors. The value of k is usually kept as an odd number to prevent any conflict. In case of continued value output, the value is the mean of the nearest Neighbors while for discrete output the value is the mode of the nearest Neighbors.
The Python code for KNN –
>>> X = [, , , ] >>> y = [0, 0, 1, 1] >>> from sklearn.neighbors import KNeighborsClassifier >>> neigh = KNeighborsClassifier(n_neighbors=3) >>> neigh.fit(X,y) KNeighborsClassifier(...) >>> print(neigh.predict([1.1])) >>> print(neigh.predict_proba[[0.9]]) [[0.66666667 0.33333333]]
6. Decision Tree:
One of the simplest CART algorithms, Decision Tree is interpretable and is not affected by the presence of outliers, or missing values in the data. The root node is chosen based on the feature which carries the maximum information and this iterative process continues in the child nodes as well.
The splitting is stopped when the tree has reached its maximum depth or all instances has been classified. Decision Tree machine learning is prone to overfitting and hence it’s required to set constraints at each step or prune the tree.
>>> from sklearn import tree >>> X = [[0,0], [1,1]] >>> Y = [0,1] >>> clf = tree.DecisionTreeClassifier() >>> clf - clf.fit(X, Y) >>> clf.predict([[2., 2.]]) array() >>> clf.predict_proba([[2., 2.]]) array([[0., 1.]])
7. Random Forest:
Random Forest is a bagging model that reduces the variance in a model. In Random Forest, the data is sampled into many small datasets which could be defined as a parameter. Then on each sampled data, the Decision Tree algorithm is applied and the final output is either the mean of all the outputs or the mode of a class.
Random Forest reduces overfitting and could be used as a dimensionality reduction techniques as well. However, it is not interpretable.
>>> from sklearn.ensemble import RandomForestClassifier >>>from sklearn.datasets import make_classification >>> X,y = make_classification(n_samples=1000, n_features=4, ... n_informative=2, n_redundant=0, ... random_state=0, shuffle=False) >>> clf = RandomForestClassifier(n_estimators=100, max_depth=2, ... random_state=0) >>> clf>fit(X, y) RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini', max_depth=2, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None, oob_score=False, random_state=0, verbose=0, warm_start=False) >>> print(clf.feature_importance) [0.14205973 0.76664038 0.0282433 0.06305659] >>> print(clf.predict([[0, 0, 0, 0]])) 
8. K-Means Clustering:
An unsupervised learning algorithm where the data needs to be clustered into k groups in such a way that within a cluster, the distance is minimized and is maximum between the two clusters.
The elbow method is used to choose the number of clusters maintaining the maximum variance in the data. Once k is defined, the centroids are initialized and adjusted repeatedly until all the points in a cluster are closest to the centroid.
>>> from sklearn.cluster import KMeans >>> import numpy as np >>> X = np.array([[1, 2], [1, 4], [1, 0], ... [10, 2], [10, 4], [10, 0]]) >>> kmeans = KMeans(n_clusters=2, random_state=0).fit(x) >>> kmeans.labels_ array([1, 1, 1, 0, 0, 0], dtype=int32) >>> kmeans.predict([[0, 0], [12, 3]]) array([1, 0], dtype=int32) >>> kmeans.cluster_centres_ array([10., 2.], [1., 2.])
The advancement in the field of python Machine Learning is endless and several new techniques and algorithms are coming out every now and then to simplify the predictive modeling tasks. This article consisted of the intuition behind some of the basic ML algorithms and their implementations in Python.
We hope you enjoy the blog post. You should try our Regression Analysis Quiz
Explore our Data science courses –
Our Popular Data science tutorials –
Data Science and Machine Learning Projects based on Industry data set –