Machine Learning Part-1

June 25, 2018 | 8 Minute Read

Machine Learning is cool because for the first time we can create something that can mimic human activities, intelligence. Just like children learn anything really fast, similarly machine learning is advancing at a rapid pace now-a-days. Commercial interest is machine learning is driving the aggresively driving the innovation on many frontier now-a-days that are solving real-world problems. This is just the beginning of artificial intelligence in general. The future in 50 years will be exciting where humans will be enhanced with artificially intelligent chips and machines will start to co-exist. Long way for us to get there. But we must start somewhere.

Steps of using machine learning:

Retrieve data.
Clean data.
Identify features or label data.
Split dataset into Test/Cross-Validation/ Test dataset
Train your algorithm using supervised or unsupervised learning on test data.
Adjust model parameters using </i>cross-validation</i>
Test on the test data.
Deploy.
Measure the accuracy or error.

Useful References

41 Essential Machine Learning Interview Questions & answers by Roger Huang.
A great Glossary from AnalyticsVidya. They have so many great intro posts.
A Good intro writeup.
Slides on ML from a course taught on edx by Columbia.

Supervised Learning
	Linear Regression
	Nonlinear multivariate Regression
	Logictic Regression
	Decision Trees
	Support Vector Machines(SVMs)
	Random Forests
	KNN, K-Nearest Neighbors
	Neural Networks


Unsupervised Learning	Clustering: K-means
	Hierarchical Cluster Analysis (HCA)
	Visualization and dimensionality reduction: PCA Principle COmponent analysis, Kernel PCA, Locality,-Linear Embedding(LLE), t-distributed Stochastic Neighbor Embedding(t-SNE)

Ensemble Learning Combining multiple learning or models for better performance. Ensemble methods outperform(superior to) individual models by averaging out biases, reducing variance, and are less likely to overfit.
There’s a common line in machine learning which is: “ensemble and get 2%.”. This implies that you can build your models as usual and typically expect a small performance boost from ensembling. More on this here.

Explain Bagging
(1) Bagging, or Bootstrap Aggregating, is an ensemble method in which the dataset is first divided into multiple subsets through resampling.
(2)Then, each subset is used to train a model, and the final predictions are made through voting or averaging the component models.
(+) Bagging is performed in parallel.
(+) Bagging reduces the variance of the meta learning algorithm. Bagging can be applied to decision tree or other algorithms.

Application specific learning algorithms:

Text, NLP
Image Processing
Audio Processing
Audio Transcription
Video
Recommendations

Here Polyfit/ SVR example.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

#https://docs.scipy.org/doc/numpy/reference/generated/numpy.polyfit.html

#install sklearn 
#https://stackoverflow.com/questions/13212987/cannot-import-scikits-learn-even-though-it-seems-to-be-installed
#http://scikit-learn.org/stable/install.html
#https://stackoverflow.com/questions/11114225/installing-scipy-and-numpy-using-pip
#sudo pip install -t . numpy scipy scikit-learn
from sklearn.svm import SVR


data = pd.read_csv('laplace2d_tol.dat', delim_whitespace=True, header=None, skiprows=14, comment="#")
#sep = "\t"  #separator if not using delim_whitespace. delim_whitespace is faster than regex
#names=["a", "b", "c"] # can be ised in read_csv instead of 'colums' used in this code
data.columns = ["iterations", "error", "runtime"]


print "rows:",len(data)
#iteration of dataframe
for index, row in data.iterrows():
    if index ==0:
        print "first entry  iterations:",row['iterations'], "---runtime:",row['runtime']
col =  data['iterations'].values.flatten()
print "data flattened:\n  ", col

#option-1: split the data in {train, test} w/ sklearn
#from sklearn.model_selection import train_test_split
#train, test = train_test_split(df, test_size=0.2)

#option-2 split the data in {train, test} with pandas
#train_rand = data.sample(frac=0.5,random_state=200)
#random_state for randomization

#split without randomization
split_ratio = 0.5
train_size = int(len(data)*split_ratio)
train = data[:train_size]
col =  train['iterations'].values.flatten()
print "trains samples:\n", col

test  = data.drop (train.index)
data['iterations'].values.flatten()

print "total train samples =", len(train)
print "total test samples =", len(test)

turns = 8
x_train = train['error'].values.flatten()
y_train = train['iterations'].values.flatten()
z = np.polyfit(x_train, y_train, turns)
p = np.poly1d(z)

test_value = 0.000073 #error here
print "  p(",test_value,"): predicted=%.3f" %(p(test_value))

x_test = test['error'].values.flatten()
y_test = test['iterations'].values.flatten()
y_predicted =  p(x_test)


####################################################
##http://scikit-learn.org/stable/auto_examples/svm/plot_svm_regression.html
#ALTERNATIVE learning SVM SR=VR

#fit regression model
svr_rbf = SVR(kernel='rbf', C=1e3, gamma=0.1)

#predict
y_rbf = svr_rbf.fit(x_train, y_train).predict(x_test)
lw=2


plt.plot(x_train, y_rbf, color='navy', lw=lw, label='RBF model')

plt.plot(x_train, y_train, 'x', label='train', color = "black",lw=lw)
plt.plot(x_test, y_test, '-', label='test-real', color="red",lw=lw )
plt.plot(x_test, y_predicted, '--', label='test-predicted', color="green", lw=lw)
plt.ylabel("iterations")
plt.xlabel("error")
plt.ylim(0,4000)
plt.grid(True)
plt.xlim(max(x_train), min(x_test))
plt.legend(loc="upper center")
plt.show()


"""
#numpy polyfit
#https://docs.scipy.org/doc/numpy/reference/generated/numpy.polyfit.html

#PROBLEM 1: predicting
x = np.array([0.000302, 0.000254, 0.000201, 0.000151])
y = np.array([58.283531, 70.682653, 91.963334, 127.307981])
z = np.polyfit(x, y, turns)
p = np.poly1d(z)
print "  p(0.000101):runtime actual=202.007 predicted=%.3f" %(p(0.000101))


#PROBLEM 2: predicting number of loops
x = np.array([0.000302, 0.000254, 0.000201, 0.000151])
y = np.array([800, 950, 1200, 1600])
z = np.polyfit(x, y, turns)
p = np.poly1d(z)
print "  p(0.000101) loops actual=2400 predicted=",p(0.000101)

#Plotting
#xp = np.linspace(0.000050, 0.000302, 20)
xp = np.linspace(0.000050, 0.000302, 20)
_ = plt.plot(x, y, 'o', xp, p(xp), '-', xp, p(xp), '--')
plt.ylim(800, 3000) # iterations limit on y-axis
plt.show()
"""

THis
PREEMPT=y and: CONFIG_1000_HZ=y To add some powersaving check this one: CONFIG_NO_HZ=y Additionally, preemt, lowlatency or the rt kernel wont make your system faster (for general tasks?). They are slightly slower than generic kernel.

Good references

* * Google Cloud Speech to Text API Python: https://github.com/GoogleCloudPlatform/python-docs-samples/tree/master/speech/cloud-client Data: #Jacobi relaxation Calculation: 2048 x 2048 mesh

    0     0.250000  0.080103
    50    0.004751  3.792906
   100    0.002397  7.241144
   150    0.001601  10.689349
   200    0.001204  14.137605
   250    0.000964  17.586176
   300    0.000804  21.034838
   350    0.000689  24.483031
   400    0.000603  27.931385
   450    0.000537  31.379790
   500    0.000483  34.828075
   550    0.000439  38.461205
   600    0.000403  42.302963
   650    0.000372  46.218797
   700    0.000345  50.199788
   750    0.000322  54.226861
   800    0.000302  58.283531
   850    0.000284  62.380708
   900    0.000269  66.513298
   950    0.000254  70.682653
  1000    0.000242  74.883781
  1050    0.000230  79.113273
  1100    0.000220  83.370917
  1150    0.000210  87.651342
  1200    0.000201  91.963334
  1250    0.000193  96.300035
  1300    0.000186  100.660911
  1350    0.000179  105.045946
  1400    0.000173  109.454092
  1450    0.000167  113.884365
  1500    0.000161  118.337551
  1550    0.000156  122.811895
  1600    0.000151  127.307981
  1650    0.000147  131.824508
  1700    0.000142  136.359728
  1750    0.000138  140.916791
  1800    0.000134  145.489694
  1850    0.000131  150.082835
  1900    0.000127  154.705981
  1950    0.000124  159.384847
  2000    0.000121  164.102636
  2050    0.000118  168.786614
  2100    0.000115  173.480620
  2150    0.000112  178.193624
  2200    0.000110  182.925185
  2250    0.000107  187.672891
  2300    0.000105  192.434249
  2350    0.000103  197.213852
  2400    0.000101  202.007015
  2450    0.000099  206.817127
  2500    0.000097  211.642450
  2550    0.000095  216.483002
  2600    0.000093  221.339118
  2650    0.000091  226.206693
  2700    0.000090  231.086084
  2750    0.000088  235.979211
  2800    0.000086  240.890114
  2850    0.000085  245.819259
  2900    0.000083  250.761324
  2950    0.000082  255.716833
  3000    0.000081  260.684375
  3050    0.000079  265.667891
  3100    0.000078  270.660522
  3150    0.000077  275.670072
  3200    0.000076  280.697543
  3250    0.000074  285.746627
  3300    0.000073  290.809013

« Bash Script Linux Free Memory Over Time Sklearn Regression »