Data Driven Modeling


PhD seminar series at Chair for Computer Aided Architectural Design (CAAD), ETH Zurich

Vahid Moosavi


Fourth Session


11 October 2016

Topics to be discussed

  • Density Learning
  • Meta-Parameters in Machine Learning
  • Regression Problem
  • Least Squares Method
    • Linear Regression
    • Polynomial Regression
    • Local Polynomial Regression
In [1]:
import warnings
warnings.filterwarnings("ignore")
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
pd.__version__
import sys
from scipy import stats
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
pd.__version__
import sys
from scipy import stats
%matplotlib inline
from bokeh.models import CustomJS, ColumnDataSource, Slider,TextInput
from bokeh.models import TapTool, CustomJS, ColumnDataSource
from bokeh.layouts import column
from bokeh.layouts import gridplot
from bokeh.models import ColumnDataSource
from bokeh.layouts import gridplot
from bokeh.plotting import figure
from bokeh.io import output_notebook, show
from bokeh.plotting import Figure, output_file, show
from bokeh.layouts import widgetbox
from bokeh.models.widgets import Slider
from bokeh.io import push_notebook
from bokeh.io import show
from bokeh.models import (
    ColumnDataSource,
    HoverTool,
    LinearColorMapper,
    BasicTicker,
    PrintfTickFormatter,
    ColorBar,
)
output_notebook()
from bokeh.layouts import column
from bokeh.models import CustomJS, ColumnDataSource, Slider
from bokeh.plotting import Figure, output_file, show
Loading BokehJS ...

Density Learning

Or "Non-parametric Density Models"

In [2]:
plt.subplot(2,1,1)
X1 = np.random.normal(loc=0,scale=1,size=1000)
plt.hist(X1,bins=100);
plt.title('A well shaped distribution: A Gaussian Distribution?')
plt.subplot(2,1,2)
X1 = np.concatenate((np.random.normal(loc=0,scale=1,size=1000) , np.random.normal(loc=10,scale=2,size=1000),np.random.normal(loc=20,scale=1,size=200)))
plt.title('Not a well shaped distribution: A mixture of three Gaussian Distributions?!')
plt.hist(X1,bins=100);
plt.tight_layout()

Now the question is that can we learn these (not well shaped) distributions?

Histograms

The first and the most data driven density model is the histogram of the data itself!

  • We just need to select bandwiths (or the number of bins) to descretize the observed data into groups
  • Then, we count the number of observations within each group
  • And then, we have it!
In [3]:
# X1 = np.random.randint(-1,2,size=400)
X1 = np.concatenate((np.random.normal(loc=0,scale=1,size=1000) , np.random.normal(loc=10,scale=2,size=1000),np.random.normal(loc=20,scale=1,size=200)))
# X1 = np.random.normal(loc=0,scale=1,size=200)

X1 = np.asarray(X1)
a = plt.hist(X1,bins=100);
# a[0] has the counts
print a[0]
# a[1] has the bins
print a[1]
print np.sum(a[0])
[   4.    6.    7.    9.   32.   39.   49.   61.   83.   97.   89.  106.
   93.   87.   82.   58.   38.   25.   18.    7.    4.    2.    4.    1.
    1.    1.    0.    0.    2.    2.    2.    3.    1.   10.   10.    9.
   13.   11.   11.   22.   30.   29.   28.   30.   33.   45.   43.   52.
   44.   59.   39.   48.   36.   45.   47.   47.   40.   41.   35.   31.
   14.   20.   12.   12.   11.    4.    5.    4.    6.    3.    2.    3.
    0.    0.    2.    0.    1.    1.    1.    1.    3.    2.    7.    5.
   13.   15.   13.   14.   16.   24.   23.   22.   12.   10.    6.    8.
    1.    0.    1.    2.]
[ -2.83414392e+00  -2.57783274e+00  -2.32152155e+00  -2.06521037e+00
  -1.80889919e+00  -1.55258801e+00  -1.29627683e+00  -1.03996565e+00
  -7.83654466e-01  -5.27343285e-01  -2.71032104e-01  -1.47209224e-02
   2.41590259e-01   4.97901440e-01   7.54212621e-01   1.01052380e+00
   1.26683498e+00   1.52314617e+00   1.77945735e+00   2.03576853e+00
   2.29207971e+00   2.54839089e+00   2.80470207e+00   3.06101325e+00
   3.31732443e+00   3.57363562e+00   3.82994680e+00   4.08625798e+00
   4.34256916e+00   4.59888034e+00   4.85519152e+00   5.11150270e+00
   5.36781388e+00   5.62412507e+00   5.88043625e+00   6.13674743e+00
   6.39305861e+00   6.64936979e+00   6.90568097e+00   7.16199215e+00
   7.41830333e+00   7.67461452e+00   7.93092570e+00   8.18723688e+00
   8.44354806e+00   8.69985924e+00   8.95617042e+00   9.21248160e+00
   9.46879278e+00   9.72510397e+00   9.98141515e+00   1.02377263e+01
   1.04940375e+01   1.07503487e+01   1.10066599e+01   1.12629711e+01
   1.15192822e+01   1.17755934e+01   1.20319046e+01   1.22882158e+01
   1.25445270e+01   1.28008381e+01   1.30571493e+01   1.33134605e+01
   1.35697717e+01   1.38260829e+01   1.40823940e+01   1.43387052e+01
   1.45950164e+01   1.48513276e+01   1.51076388e+01   1.53639500e+01
   1.56202611e+01   1.58765723e+01   1.61328835e+01   1.63891947e+01
   1.66455059e+01   1.69018170e+01   1.71581282e+01   1.74144394e+01
   1.76707506e+01   1.79270618e+01   1.81833729e+01   1.84396841e+01
   1.86959953e+01   1.89523065e+01   1.92086177e+01   1.94649289e+01
   1.97212400e+01   1.99775512e+01   2.02338624e+01   2.04901736e+01
   2.07464848e+01   2.10027959e+01   2.12591071e+01   2.15154183e+01
   2.17717295e+01   2.20280407e+01   2.22843518e+01   2.25406630e+01
   2.27969742e+01]
2200.0
In [4]:
plt.plot(a[1][:-1],a[0],'.-');

But this looks noisy!

Now if we want to do some smoothing (denoising), we are in the field of Kernel Density Estimation

<img src="https://wikimedia.org/api/rest_v1/media/math/render/svg/15255412d35488194b7e24e4518765e0af9992b5"width =700, height=700/>

where K(•) is the kernel — a non-negative function that integrates to one and has mean zero — and h > 0 is a smoothing parameter called the bandwidth.

  • what happens here is that we draw a small distribution around each of the observed points and then, sum it over the whole range
  • These Kernels can be any of the parametric distributions we discusses before.
In [5]:
# A basic example with few data points

X1 = [-2,-1,0,2,5,6,-2,0]
X1 = np.asarray(X1)
a = plt.hist(X1,bins=20);
In [6]:
def GaussianKernel(u):
    # here we assume a "standard normal distribution" with mean zero and variance 1
    return np.exp(-1*np.power(u, 2)/2)/(np.sqrt(2*np.pi))


def Kde_Gaussian(h=3):
    mn = np.min(X1)-3
    mx = np.max(X1)+3
    
    # h is the Bandwidth
    
    
    # range of the the data
    xrng =  np.linspace(mn,mx,num=500)
    
    counts = np.zeros((1,500))
    
    
    # this is just to histogram of the data
#     a = plt.hist(X1,bins=len(X1),normed=True);
#     plt.close()
#     for i,d in enumerate(a[0]):
#         plt.plot([a[1][i],a[1][i]],[0,d],'-or',markersize=10);

    # Now we go through all the data points one by one. 
    # Assume that the observed point is the center of Gaussian Distribution with standard deviation equal to the bandwidth
    for xi in X1:
        local_counts = [GaussianKernel((x-xi)/h) for x in xrng]
        # (x-xi)/h is to transform the data to "standard normal distribution"!
        
        counts = counts + np.asarray(local_counts)/(h*len(X1))
        
        # this to normalize the counts. Look at the formula above
        plt.plot(xrng,np.asarray(local_counts)/(h*len(X1)),'r');
    
    plt.plot(xrng,counts.T,linewidth=4);
    
    
In [7]:
Kde_Gaussian(1)
In [8]:
# #ipywidgets
# from ipywidgets import interact, HTML, FloatSlider
# interact(Kde_Gaussian,h=(.1,5,.1));
In [9]:
mn = np.min(X1)-3
mx = np.max(X1)+3
# range of the the data
xrng =  np.linspace(mn,mx,num=500).T[:,np.newaxis]
counts = np.zeros((1,500)).T
# local_counts = np.zeros((1,500)).T

#unfortunately, source in bokeh needs columns with the same size. So we fake it
observations =  np.zeros((1,500)).T
observations[:len(X1),0] = X1.T
dlen = np.asarray([len(X1) for i in range(len(xrng))])[:,np.newaxis]

df = pd.DataFrame(data=np.concatenate((xrng,counts,observations,dlen),axis=1),columns=['xrng','counts','observations','dlen'])

for i in range(len(X1)):
    df['local_counts'+str(i)] = np.zeros((1,500)).T
h= 1
for i in range(dlen[0]):
    local_counts = df['local_counts'+str(i)].values[:]
    for j in range(len(xrng)):
        u = (xrng[j]-observations[i])/h    
        local_counts[j] = np.exp(-1*np.power(u, 2)/2)/(np.sqrt(2*np.pi))/(h*dlen[0])
        counts[j] = counts[j] + local_counts[j]

    df['local_counts'+str(i)] = local_counts
df['counts'] = counts


    
source = ColumnDataSource(data=df)

p = figure(plot_width=600, plot_height=400,title="Kernel Density Estimation")
p.circle(X1, np.zeros(len(X1)), size=10, line_color="navy", fill_color="red", fill_alpha=0.99,legend="observations")

for i in range(len(X1)):
    p.line('xrng', 'local_counts'+str(i),alpha=0.6, line_width=1 ,source=source, line_color="red",legend="kernels")

p.line('xrng','counts',line_width=4,source=source, line_color="navy",legend="KDE");


def callback1(source=source, window=None):
    data = source.data
    h = cb_obj.value
    
    dlen = data['dlen']
    xrng = data['xrng']
    counts = data['counts']
    for i in range(len(counts)):
        counts[i] = 0
#     
    observations =data['observations']
    for i in range(dlen[0]):
        local_counts = data['local_counts'+str(i)]
#         window.console.log(local_counts)
        for j in range(len(xrng)):
            u = (xrng[j]-observations[i])/h
            
            
            local_counts[j] = window.Math.exp(-1*window.Math.pow(u, 2)/2)/(window.Math.sqrt(2*window.Math.PI))/(h*dlen[0])
            counts[j] = counts[j] + local_counts[j]
    source.change.emit()
    
    
slider1 = Slider(start=.0001, end=8, value=1, step=.05, title="bandwidth",
                callback=CustomJS.from_py_func(callback1))


slider1.callback.update()

layout = column(slider1, p)

show(layout)

So, now what is the optimum bandwidth in the above density learning problem?

  • minimizing some error functions in relation to the bandwidth
  • Rules of thumb methods
  • cross validation ---> testing the learned density with new data sets

A note on the issue of meta-parameters in Data Driven Modeling

Bandwidth value is usually called a " Meta-Parameter" or "Hyper-Parameter". It is important to know that in all of data driven modeling techniques, we will always have these types of parameters, where we need to find their best values in a "Meta- optimization" problem.

Usually, we take any of the following approaches separately, or together:

  • Exhaustive search
  • Random grid search
  • Cross Validation

Density Learning in Practice

However, usually there are optimum libraries for these kinds of tasks!

  • Scipy
  • Scikit learn
  • PyQt-Fit
In [10]:
# A two dimensional example
fig = plt.figure()
d0 = 1.6*np.random.randn(1000,2)
d0[:,0]= d0[:,0] - 3
plt.plot(d0[:,0],d0[:,1],'.b')

d1 = .6*np.random.randn(1000,2)+7.6
plt.plot(d1[:,0],d1[:,1],'.b')

d2 = .8*np.random.randn(1000,2)
d2[:,0]= d2[:,0] + 5
d2[:,1]= d2[:,1] + 1
plt.plot(d2[:,0],d2[:,1],'.b')

d3 = .8*np.random.randn(1000,2)
d3[:,0]= d3[:,0] - 5
d3[:,1]= d3[:,1] + 4
plt.plot(d3[:,0],d3[:,1],'.b')


d4 = 1.8*np.random.randn(1000,2)
d4[:,0]= d4[:,0] - 15
d4[:,1]= d4[:,1] + 14
plt.plot(d4[:,0],d4[:,1],'.b')


d5 = 2.8*np.random.randn(1000,2)
d5[:,0]= d5[:,0] + 25
d5[:,1]= d5[:,1] + 14
plt.plot(d5[:,0],d5[:,1],'.b')

fig.set_size_inches(5,5)
Data = np.concatenate((d0,d1,d2,d3,d4,d5))



# from scipy import stats
# def measure(n):
#     "Measurement model, return two coupled measurements."
#     m1 = np.random.normal(size=n)
#     m2 = np.random.normal(scale=0.5, size=n)
#     return m1+m2, m1-m2

# m1, m2 = measure(2000)
m1,m2 = Data[:,0],Data[:,1]

xmin = Data[:,0].min()
xmax = Data[:,0].max()
ymin = Data[:,1].min()
ymax = Data[:,1].max()
Xg, Yg = np.mgrid[xmin:xmax:100j, ymin:ymax:100j]
In [11]:
val_grids = Xg
# values = np.vstack([m1, m2])
# values = Data.T
kernel = stats.gaussian_kde(Data[:,0])
Z = kernel(val_grids.ravel())


import matplotlib.pyplot as plt

fig, ax = plt.subplots()
a = ax.hist(Data[:,0],bins=300,normed=True);


ax.plot(val_grids.ravel(),Z,linewidth=5);
ax.plot(a[1][:-1],a[0])
plt.show()
In [12]:
X, Y = np.mgrid[xmin:xmax:100j, ymin:ymax:100j]
positions = np.vstack([X.ravel(), Y.ravel()])
# values = np.vstack([m1, m2])
values = Data.T
kernel = stats.gaussian_kde(values)
Z = np.reshape(kernel(positions).T, X.shape)

import matplotlib.pyplot as plt
fig, ax = plt.subplots()

ax.imshow(np.rot90(Z), cmap=plt.cm.gist_earth_r,)
plt.contour(np.rot90(Z))
# ax.plot(m1, m2, 'k.', markersize=2)
# ax.set_xlim([xmin, xmax])
# ax.set_ylim([ymin, ymax])
fig.set_size_inches(7,7)
plt.show()

The importance of the learned densities

  • Resampling: to generate data without observations
  • Clustering and pattern recognition and its relation to "Manifold Learning"
    • Here we know the likelihood of the high dimensional patterns in comparison to each other
    • Here, in a way we are learning the joint probability distributions
    • In Machine Learning terms, this is an unsupervised learning problem
  • Prediction: Using it for predictions/ transfer function by conditioning the outcomes to a set of dimensions
    • In this case, we are interested to know the likelihood of certain variables (dependent variables) conditioned on certain variables (** independent variables")
    • In this sense, we can derive "supervised learning" from "unsupervised learning"

A similar Approach

  • Gaussian Mixture Model: To learn the distribution of the data as a weighted sum of several globally defined Gaussian Distributions: ## $$g(X) = \sum_{i = 1}^k p_i. g(X,\theta_i)$$ ### $ g(X,\theta_i)$ is a parametric known distribution (e.g. Gaussian) and $p_i$ is the share of each of them.

Further readings:

An alternative approach from ML (Hopefully the next session):

Manifold Learning

  • Self Organizing Maps


Prediction, using the learned densities

An example of prediction with the learned density in the previous example

In [19]:
fig = plt.figure(figsize=(10,7))
X, Y = np.mgrid[xmin:xmax:100j, ymin:ymax:100j]
cm = plt.cm.get_cmap('RdYlBu_r')
positions = np.vstack([X.ravel(), Y.ravel()])
# values = np.vstack([m1, m2])
values = Data.T
kernel = stats.gaussian_kde(values)
Z = np.reshape(kernel(positions).T, X.shape)
zmn = np.min(Z)
zmx = np.max(Z)
rng = zmx-zmn
Z = Z.ravel()
X = X.ravel()
Y = Y.ravel()
# for i in range(len(X)):
var = (Z-zmn)/float(rng)
# sc = plt.scatter(X, Y, s=10, marker=u'o', facecolor=plt.cm.RdYlBu_r(var), vmin=zmn, vmax=zmx, edgecolor='None')
sc = plt.scatter(X,Y,c = Z,s=20,vmin=zmn,marker='o',edgecolor='None', vmax=zmx, cmap=cm ,alpha=1);

plt.colorbar(sc,ticks=np.round(np.linspace(zmn,zmx,5),decimals=3),shrink=0.6);

sc = plt.scatter(X[3000:3100],Y[3000:3100],c = Z[3000:3100],s=20,vmin=zmn,marker='o',edgecolor='red', vmax=zmx, cmap=cm ,alpha=1);
plt.xlim(xmin,xmax);
plt.ylim(ymin,ymax);
plt.xlabel('X');
plt.ylabel('Y');