Data Driven Modeling


PhD seminar series at Chair for Computer Aided Architectural Design (CAAD), ETH Zurich

Vahid Moosavi


Ninth Session


29 November 2016

Markov Chains

Introduced by Andrei Markov in 1906

His original work on the sequence of charachters in language

One of the earilest data driven models of the langugae

Nevertheless, he didn't succeed as his model is data and computation intensive.

Later it was used extensively to study dynamic (stochastic) systems.

Recently, it was used as data driven representation approach.

Therefore, we discuss Markov Chains from the following aspects:

  • From the point of view of dynamical systems
  • From the point of view of object representation
  • Properties and applications
  • Extensions to machine learning applications

Topics to be discussed

  • Text generation with Markov Chains
  • Markov Chains from the point of view of relational representation
  • Neuro-probablistic Models of the language
  • Natural Language Modeling problems
In [1]:
import warnings
warnings.filterwarnings("ignore")
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
pd.__version__
import sys
from scipy import stats
import time
import  pysparse
from scipy.linalg import norm
import sompylib.sompy as SOM

%matplotlib inline

Generative examples of Markov

Example of sequence of characters in English texts

In [2]:
with open('./Data/IMDB_data/pos.txt','r') as infile:
    reviews = infile.readlines()
len(reviews)
Out[2]:
25000
In [3]:
reviews[0]
Out[3]:
'Bromwell High is a cartoon comedy. It ran at the same time as some other programs about school life, such as "Teachers". My 35 years in the teaching profession lead me to believe that Bromwell High\'s satire is much closer to reality than is "Teachers". The scramble to survive financially, the insightful students who can see right through their pathetic teachers\' pomp, the pettiness of the whole situation, all remind me of the schools I knew and their students. When I saw the episode in which a student repeatedly tried to burn down the school, I immediately recalled ......... at .......... High. A classic line: INSPECTOR: I\'m here to sack one of your teachers. STUDENT: Welcome to Bromwell High. I expect that many adults of my age think that Bromwell High is far fetched. What a pity that it isn\'t!\n'
In [4]:
def cleanText(corpus):
    import string
    validchars = string.ascii_letters + string.digits + ' '
    punctuation = """.,:;?!@(){}[]$1234567890"""
    corpus = [z.lower().replace('\n','') for z in corpus]
    corpus = [z.replace('<br />', ' ') for z in corpus]
    
    for c in punctuation:
        corpus =[z.replace(c, '') for z in corpus]
    

    corpus = [''.join(ch for ch in z if ch in validchars) for z in corpus]
    
    #treat punctuation as individual words
    for c in punctuation:
        corpus = [z.replace(c, ' %s '%c) for z in corpus]
#     corpus = [z.split() for z in corpus]
    corpus = [z.replace(' ', '_') for z in corpus]
    return corpus
In [5]:
texts  = cleanText(reviews)
In [6]:
texts[0]
Out[6]:
'bromwell_high_is_a_cartoon_comedy_it_ran_at_the_same_time_as_some_other_programs_about_school_life_such_as_teachers_my__years_in_the_teaching_profession_lead_me_to_believe_that_bromwell_highs_satire_is_much_closer_to_reality_than_is_teachers_the_scramble_to_survive_financially_the_insightful_students_who_can_see_right_through_their_pathetic_teachers_pomp_the_pettiness_of_the_whole_situation_all_remind_me_of_the_schools_i_knew_and_their_students_when_i_saw_the_episode_in_which_a_student_repeatedly_tried_to_burn_down_the_school_i_immediately_recalled__at__high_a_classic_line_inspector_im_here_to_sack_one_of_your_teachers_student_welcome_to_bromwell_high_i_expect_that_many_adults_of_my_age_think_that_bromwell_high_is_far_fetched_what_a_pity_that_it_isnt'
In [7]:
all_chars = '_abcdefghijklmnopqrstuvwxyz'
dictionary = {}
for i in range(len(all_chars)):
    dictionary[all_chars[i]] = i
dictionary
Out[7]:
{'_': 0,
 'a': 1,
 'b': 2,
 'c': 3,
 'd': 4,
 'e': 5,
 'f': 6,
 'g': 7,
 'h': 8,
 'i': 9,
 'j': 10,
 'k': 11,
 'l': 12,
 'm': 13,
 'n': 14,
 'o': 15,
 'p': 16,
 'q': 17,
 'r': 18,
 's': 19,
 't': 20,
 'u': 21,
 'v': 22,
 'w': 23,
 'x': 24,
 'y': 25,
 'z': 26}
In [8]:
# building data with the format of sequence
data = []
for text in texts[:]:
    d = []
    for c in text:
        d.append(dictionary[c])
    data.append(d)
print len(data)
25000
In [9]:
def buildTM_from_sequential_data(data,states,irreducible=True):
    # each row is a sequence of observation
    n = len(states)
    M = np.zeros((n,n))
    for d in data:
        for k in range(1,len(d)):
            i = d[k-1]
            
            j = d[k]
            M[i,j]= M[i,j] + 1
    
    eps = .001
    for i in range(M.shape[0]):
        s= sum(M[i])
        
        if s==0:
            if irreducible==True:
                M[i]=eps
                M[i,i]=1.
                s= sum(M[i])
                M[i]=np.divide(M[i],s)
            else:
                M[i,i]=1.
        else:
            M[i]=np.divide(M[i],s)    
    return M


# Power iteration Method
def simulate_markov(TM,verbose='on'):
    e1 = time.time()
    states_n = TM.shape[0]
    pi = np.ones(states_n);  pi1 = np.zeros(states_n);
    pi = np.random.rand(states_n)
   
    pi = pi/pi.sum()
    n = norm(pi - pi1); i = 0;
    diff = []
    while n > 1e-6 and i <1*1e4 :
        pi1 = TM.T.dot(pi).copy()
        n = norm(pi - pi1); i += 1
        diff.append(n)
        pi = pi1.copy()
    if verbose=='on':
        print "Iterating {} times in {}".format(i, time.time() - e1)
    
    mixing_ = i
    return pi1,mixing_
In [10]:
states = np.unique(dictionary.values())
M_char = buildTM_from_sequential_data(data,states,irreducible=True)
In [11]:
chars = np.asarray([c for c in all_chars])

steady state probabilites : equal to the frequencies of characters

In [12]:
pi,mixing_ = simulate_markov(M_char,verbose='on')
plt.plot(pi);
plt.xticks(range(27),chars);
plt.grid()
Iterating 13 times in 0.000757932662964

Markov chain as a generative model?!!

What is the next probable characters for a give charachter

In [13]:
# To see if we can generate something
n_state = M_char.shape[0]
ind_initial = np.random.randint(0,n_state,size=1)
print chars[ind_initial[0]]
ind = ind_initial[0]
for i in range(20):

    
    # If we take the most likely next chars, it quickly falls in a loop?!!
    ind = np.argmax(M_char[ind])
    
    
    # If we take the next char based on a random choice based on the probabilites 
#     ind = np.random.choice(range(M_char.shape[0]),size=1,p=M_char[ind])[0]
    
    print chars[ind]
e
_
t
h
e
_
t
h
e
_
t
h
e
_
t
h
e
_
t
h
e
In [14]:
# To see if we can generate something
n_state = M_char.shape[0]
ind_initial = np.random.randint(0,n_state,size=1)
print chars[ind_initial[0]]
ind = ind_initial[0]
for i in range(20):

    
#     If we take the most likely next chars, it quickly falls in a loop?!!
#     ind = np.argmax(M_char[ind])
    
    
    # If we take the next char based on a random choice based on the probabilites 
    ind = np.random.choice(range(M_char.shape[0]),size=1,p=M_char[ind])[0]
    
    print chars[ind]
s
_
a
r
o
n
o
n
s
_
w
a
t
o
_
e
r
e
_
w
a

at the char level, it is highly unlikely to expect something interesting

World level Markov chain

with more depth: higher order Markov chains

In [16]:
# codes from https://github.com/codebox/markov-text
import sys

sys.path.insert(0, './markovtext')

from db import Db
from gen import Generator
from parse import Parser
from sql import Sql
from rnd import Rnd
import sys
import sqlite3
import codecs





SENTENCE_SEPARATOR = '.'
WORD_SEPARATOR = ' '

args = ['','gen','IMDB2','2']

if (len(args) < 3):
	raise ValueError(usage)
mode  = 'gen'
name  = './markovtext/IMDB_N2'
count = 4


if mode == 'parse':
    
    depth = 2
    file_name = './Data/IMDB_data/pos.txt'

    db = Db(sqlite3.connect(name + '.db'), Sql())
    db.setup(depth)

    txt = codecs.open(file_name, 'r', 'utf-8').read()
    Parser(name, db, SENTENCE_SEPARATOR, WORD_SEPARATOR).parse(txt)

elif mode == 'gen':    
    db = Db(sqlite3.connect(name + '.db'), Sql())
    generator = Generator(name, db, Rnd())
    for i in range(0, count):
        print "{}\n".format(i)
        print generator.generate(WORD_SEPARATOR)
        

else:
	raise ValueError(usage)
0

Ms
1


2

I could be a bad thing? Never Again: It has done out what was still want to changed, or surprises like treasha! Congrats to leave from the grasp some extent
3

My own "Western" (i like a vivid matte how timeless "The Amanda Bynes seems to ensure the Cameron said that most TV I love story of film-making team is my head has hope the film to modify the end we love it looks nice to create a quest for Catwoman (the delicious Paris with a very extensively, it had directed by everyone

As it can be seen it is far from reasonable

In fact, the problem of "generative models" is still an open question, unlike "discriminative models"

So, what can we do with this relational representation?

Representation Learning

Representation of objects based on a priori given features

Set theoretical defnition of universals --- > Abstract Universals

Representation of objects based on its context: Relational Representation

Category theoretical defnition of universals --- > Concrete Universals

This is aslso called Distributional Semantic Models with references to (de Saussure, 1966; Harris, 1951 and Wittgenstein, 1963; Firth, 1957)

sources from https://www.inf.uni-hamburg.de/en/inst/ab/lt/publications/cogalex-invited-biemann.pdf

Now when we look at the original idea of Markov in dealing with stochastic systems, it is pure relational

This is the case in many applications:

  • pixels in an image and their neighboring cells
  • A house and its neigborhood
  • A person and his friends
  • an ingridient in a food recepie
  • ...

co-occurance matrix

Now if we take each row as the representation of each char, we can assume that we have a proper representation.

In fact, we have a matrix form for our data, where each object (e.g. char or a word) is being represented based on its normalized co-occurance matrix

Char markov chain

In [17]:
# For each char
ind_initial = np.random.randint(0,n_state,size=1)[0]

print 'the selected char: {}'.format(chars[ind_initial])
plt.plot(range(M_char.shape[0]),M_char[ind_initial],'.-');

plt.xticks(range(M_char.shape[0]),chars);
plt.grid();
the selected char: i

Let's train a SOM with this matrix and see how it works

In [18]:
import sompylib.sompy as SOM

msz11 =20
msz10 = 20

X = M_char

som_char = SOM.SOM('', X, mapsize = [msz10, msz11],norm_method = 'var',initmethod='pca')
# som1 = SOM1.SOM('', X, mapsize = [msz10, msz11],norm_method = 'var',initmethod='pca')
som_char.init_map()
som_char.train(n_job = 1, shared_memory = 'no',verbose='final')
codebook_char = som_char.codebook[:]
codebook_char_n = SOM.denormalize_by(som_char.data_raw, codebook_char, n_method = 'var')
Total time elapsed: 2.790000 secodns
final quantization error: 0.003545
In [19]:
# we projects all the vectors in SOM and visualize it 
xy = som_char.ind_to_xy(som_char.project_data(X))
xy
Out[19]:
array([[ 19,  13, 393],
       [ 17,   4, 344],
       [  5,   0, 100],
       [  0,  19,  19],
       [  9,   7, 187],
       [ 19,  19, 399],
       [  5,   9, 109],
       [  6,  16, 136],
       [  0,   9,   9],
       [ 19,   8, 388],
       [  4,  13,  93],
       [  8,   4, 164],
       [ 14,   0, 280],
       [  4,   4,  84],
       [ 14,  14, 294],
       [ 19,   0, 380],
       [  9,   0, 180],
       [ 10,  14, 214],
       [ 12,   4, 244],
       [  9,  19, 199],
       [  4,  19,  99],
       [ 15,   8, 308],
       [  0,   5,   5],
       [  0,  14,  14],
       [ 14,  19, 299],
       [ 11,  10, 230],
       [  0,   0,   0]])
In [20]:
ax = plt.subplot(1,1,1)
for i in range(len(X)):
    plt.annotate(chars[i], (xy[i,1],xy[i,0]),size=20, va="center")
    plt.xlim((0,som_char.mapsize[0]))
    plt.ylim((0,som_char.mapsize[0]))
plt.xticks([])
plt.yticks([])
Out[20]:
([], <a list of 0 Text yticklabel objects>)

In the char model, we can't make sure if it makes sense, maybe at the world level, it works

Nevertheless, we only represent each chars based on its relation with the next possible chars

And this means to loose lots of valuable data

At the word level, defnitely we get better results, but..

We will have "state space explosion" with few amount of texts

Therefore, we need to perform some types of dimensionality reduction.

For example, to use PCA or similar methods (LSA,LDA,SVD,...) to reduce the dimensionality.

Problems: Scalability and speed, memory,...!

Neural embeddings!


A game changer idea: do not build the co-occurence matrix explicitly!

Instead learn a classifier that predicts a word given a context around the word or the other way around.

(Bengio et al. 2003) and Mikolov et al., 2013 known as Word2vec

In word2Vec there are two main models

CBOW

Skipgram

  • We have v unique words
  • each word is a one-hot v dimensional vector
  • we have Two matrice: W1(vxn) ,W2(nxv)
  • n is the dimension we choose (50-1000)
  • read arrows as dot product
  • there is an objective function that implies: given a word the network should predict the contexts and vice versa
  • if one can write a parametric objective function, there are usually severalmethods to find an optimum value for it
  • In majorty of machine learning methods they use Stochastic Gradient Descent (SGD) along chain rule (we discuss them in detail in one session
  • After training the W1 has amazing features

Details about training: http://www-personal.umich.edu/~ronxin/pdf/w2vexp.pdf

In [21]:
## Just a hint to Gradient Descent
In [22]:
x_old = 0
x_new = 9
eps = .001
precision = .00001

# y = x^2
def f(x):
    return np.power(x,4) -3*np.power(x,3)  + 2

def f_deriv(x):
    return 4*np.power(x,3) - 9*x



counter = 0
while abs(x_old-x_new)>precision:
    x_old = x_new
    x_new = x_old - eps*f_deriv(x_old)
    plt.plot(x_new,f(x_new),'or')
    counter = counter + 1
print x_new 

for x in np.linspace(-10,10,100):
    plt.plot(x,f(x),'.b',markersize=1)
1.50053826867

in principle, in most of machine learning methods, we have a loss function which is differetiable to all the parameters.

Therefore, using the training data we gradually update the parameters toward a direction that minimizes the loss function

we will discuss it in more details later on neural networks

For now let's go back to Word2vec

In [23]:
#### We use a beautiful library called, gensim
import gensim
from gensim import corpora, models, similarities
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
In [24]:
# from gensim.models import word2vec
# # get the pretrained vector from https://code.google.com/archive/p/word2vec/
# Google_w2v = word2vec.Word2Vec.load_word2vec_format('/Users/SVM/Downloads/GoogleNews-vectors-negative300.bin', binary=True)
In [25]:
# print Google_w2v.most_similar(['girl', 'father'], ['boy'], topn=1)
# print Google_w2v.most_similar(positive=['woman', 'king'], negative=['man'], topn=10)

What can be done with Word2vec?

In general the learned densed vectors can be used in any other tasks that require fixed lenght vectors.

Some of interesting applications:

Movie Reviews Sentiment Analysis

In [26]:
%matplotlib inline
from sklearn.cross_validation import train_test_split
from gensim.models.word2vec import Word2Vec
import numpy as np
with open('/All_Files/Files/Data/gensim/sample_Data/IMDB_data/pos.txt', 'r') as infile:
    pos_tweets = infile.readlines()

with open('/All_Files/Files/Data/gensim/sample_Data/IMDB_data/neg.txt', 'r') as infile:
    neg_tweets = infile.readlines()
    
with open('/All_Files/Files/Data/gensim/sample_Data/IMDB_data/unsup.txt','r') as infile:
    unsup_reviews = infile.readlines()

#use 1 for positive sentiment, 0 for negative
y = np.concatenate((np.ones(len(pos_tweets)), np.zeros(len(neg_tweets))))

x_train, x_test, y_train, y_test = train_test_split(np.concatenate((pos_tweets, neg_tweets)), y, test_size=0.5)

#Do some very minor text preprocessing



def cleanText(corpus):
    import string
    validchars = string.ascii_letters + string.digits + ' '
    punctuation = """.,:;@(){}[]$1234567890"""
    corpus = [z.lower().replace('\n','') for z in corpus]
    corpus = [z.replace('<br />', ' ') for z in corpus]
    
    for c in punctuation:
        corpus =[z.replace(c, '') for z in corpus]
    

    corpus = [''.join(ch for ch in z if ch in validchars) for z in corpus]
    
    #treat punctuation as individual words
    for c in punctuation:
        corpus = [z.replace(c, ' %s '%c) for z in corpus]
    corpus = [z.split() for z in corpus]
#     corpus = [z.replace(' ', '_') for z in corpus]
    return corpus


x_train_c = cleanText(x_train)
x_test_c = cleanText(x_test)
unsup_  = cleanText(unsup_reviews)

Train Word2vec model here

In [75]:
n_dim = 150
#Initialize model and build vocab
imdb_w2v = Word2Vec(size=n_dim, min_count=10,
                    sentences=None, alpha=0.025, window=5, max_vocab_size=None,
                    sample=0, seed=1, workers=6, min_alpha=0.0001, sg=1, hs=1, negative=0, cbow_mean=0,
                    iter=1, null_word=0)


# imdb_w2v.build_vocab(np.concatenate((unsup_,x_train)))
imdb_w2v.build_vocab(x_train_c)

# Train the model over train_reviews (this may take several minutes)
# imdb_w2v.train(np.concatenate((unsup_,x_train)))
imdb_w2v.train(x_train_c)
Out[75]:
5535848
In [76]:
print imdb_w2v.most_similar(['good'], topn=5)
# print imdb_w2v.most_similar(positive=['woman', 'king'], negative=['man'], topn=10)
[('decent', 0.6008697748184204), ('great', 0.5777084827423096), ('nice', 0.563409686088562), ('bad', 0.5431483387947083), ('cool', 0.5381938219070435)]

To make DocVec based on SOMind

Now the question is how to make prediction at the document level

In [77]:
import pandas as pd
import pandas.io.data
import numpy as np
from matplotlib import pyplot as plt
import sys
from sklearn.preprocessing import scale

pd.__version__
Out[77]:
u'0.18.0'
In [78]:
# Here, we have the choice of using ay pretrained model too
Googlevec = 'No'
import gensim


vocablen = len(imdb_w2v.vocab.keys())

# vocablen = len(uniq_from_x_train)

vector_size = imdb_w2v.vector_size
VocabVec = np.zeros((vocablen,vector_size))

vocab = imdb_w2v.vocab.keys()



for i in range(vocablen):
    if Googlevec=='Yes':
        try:
            VocabVec[i] = Google_w2v[vocab[i]]
        except:
            continue
    else:
        try:
            VocabVec[i] = imdb_w2v[vocab[i]]
        except:
            continue
        
print Googlevec 

print 'data size', VocabVec.shape
No
data size (20332, 150)
In [79]:
def buildDocHistogram(Vocab_ind, text, ind_size,normalize='Yes'):
    vec = np.zeros(ind_size).reshape((1, ind_size))
    count = 0.
    for word in text:
        try:
            vec[0,Vocab_ind[word]] += 1
            count += 1.
        except KeyError:
            continue
    if count != 0:
        if normalize=='Yes':
            vec /= count
    return vec
In [80]:
#Build new dim for vocabs based on SOMinds
print 'data size', VocabVec.shape
ind_final_vocab = VocabVec.sum(axis=1)!=0

final_VocabVec = VocabVec[ind_final_vocab]

final_vocab = list(np.asarray(vocab)[ind_final_vocab])
Vocab_Wordind = dict(zip(final_vocab,range(len(final_vocab)) ))
data size (20332, 150)
In [81]:
ind_size = len(final_vocab)
labels = y_train
ind_pos = labels==1
ind_neg = labels==0
all_coocur_ = np.zeros((len(x_train_c),ind_size))
for i in range(len(x_train_c)):
    all_coocur_[i]= buildDocHistogram(Vocab_Wordind, x_train_c[i], ind_size,normalize='No')
all_coocur_  = all_coocur_.sum(axis=0)

print 'all done'
len_neg = len(list(np.asarray(x_train_c)[ind_neg]))
neg_coocur_ = np.zeros((len_neg,ind_size))
for i,text in enumerate(list(np.asarray(x_train_c)[ind_neg])):
    neg_coocur_[i,:]= buildDocHistogram(Vocab_Wordind, text, ind_size,normalize='No')
neg_coocur_  = neg_coocur_.sum(axis=0)

print 'neg done'
len_pos = len(list(np.asarray(x_train_c)[ind_pos]))
pos_coocur_ = np.zeros((len_pos,ind_size))
for i,text in enumerate(list(np.asarray(x_train_c)[ind_pos])):
    pos_coocur_[i,:]= buildDocHistogram(Vocab_Wordind, text, ind_size,normalize='No')
pos_coocur_  = pos_coocur_.sum(axis=0)

print 'pos done'
all done
neg done
pos done
In [82]:
labels = y_train
ind_pos = labels==1
ind_neg = labels==0
#Make the histogram of documents basedo n SOMinds
ind_size = len(final_vocab)


# all_coocur_ = np.concatenate([buildDocHistogram(Vocab_Wordind, z, ind_size,normalize='No') for z in x_train_c])
# pos_coocur_ = np.concatenate([buildDocHistogram(Vocab_Wordind, z, ind_size,normalize='No') for z in list(np.asarray(x_train_c)[ind_pos])])
# neg_coocur_ = np.concatenate([buildDocHistogram(Vocab_Wordind, z, ind_size,normalize='No') for z in list(np.asarray(x_train_c)[ind_neg])])


# #Summing over all texts for each word
# pos_coocur_ = pos_coocur_.sum(axis=0)
# neg_coocur_ = neg_coocur_.sum(axis=0)
# all_coocur_ = all_coocur_.sum(axis=0)

#normalizing the values
# pos_coocur_ = pos_coocur_/all_coocur_
# neg_coocur_ = neg_coocur_/all_coocur_

pos_to_neg = pos_coocur_/(neg_coocur_+1)

sorted_features =pd.DataFrame(index=range(pos_coocur_.shape[0]))
sorted_features['words'] = Vocab_Wordind.keys()
sorted_features['pos_coocur_'] = pos_coocur_
sorted_features['neg_coocur_'] = neg_coocur_
sorted_features['pos_to_neg'] = pos_to_neg
sorted_features['differ'] = np.abs(neg_coocur_-pos_coocur_)
sorted_features = sorted_features.sort_values('differ',ascending=False)
sorted_features.head()
Out[82]:
words pos_coocur_ neg_coocur_ pos_to_neg differ
17071 predicted 87764.0 73465.0 1.194621 14299.0
5983 was 171797.0 163184.0 1.052774 8613.0
17619 paddy 76676.0 68766.0 1.115012 7910.0
11015 down 56285.0 49633.0 1.134001 6652.0
11027 annmargret 50111.0 43690.0 1.146941 6421.0
In [83]:
sorted_features.shape
Out[83]:
(20332, 5)
In [84]:
###############
###############


### It seems that having all the features is not that bad! Even the results are similar, eventhough it might slow down the 
### som trainig and som projection steps, it dosne't need conditional probabilities to be calculated
sel_features = sorted_features.index[:15000].values
Data= final_VocabVec[sel_features,:]


# sel_features = sorted_features.index[:].values
# Data= final_VocabVec

# len(sel_vocab)


#Train a SOM based on vocabs
# reload(sys.modules['sompy'])
ind_size = 3000
sm1 = SOM.SOM('sm', Data, mapsize = [1,ind_size],norm_method = 'var',initmethod='pca')
# ind_size = 50*50
sm1.train(n_job = 1, shared_memory = 'no',verbose='final')
print 'Training Done'

# sm1.hit_map()
print sm1.codebook.shape

#Remained Data
print sm1.data.shape


#Build new dim for vocabs based on SOMinds
Vocab_Somind = dict(zip(list(np.asarray(final_vocab)[sel_features]), list(sm1.project_data(Data))))
# Vocab_Somind = dict(zip(list(np.asarray(final_vocab)[:]), list(sm1.project_data(Data))))


# Vocab_Somind = dict(zip(final_vocab, list(sm1.project_data(Data))))
Total time elapsed: 153.612000 secodns
final quantization error: 8.289663
Training Done
(3000, 150)
(15000, 150)
In [85]:
DF = pd.DataFrame()
DF['word']=np.asarray(final_vocab)[sel_features]
b = sm1.project_data(Data)
DF['somind'] = b

DF.sort_values('somind')[:10]
Out[85]:
word somind
1334 ford 0
977 peter 0
2261 l 0
1021 wood 0
1149 holmes 0
1978 jr 0
1205 jim 0
2137 sir 0
11396 williams 0
11322 wayne 0
In [86]:
from sklearn.preprocessing import scale
#Make the histogram of documents basedo n SOMinds
train_vecs = np.concatenate([buildDocHistogram(Vocab_Somind, z, ind_size) for z in x_train_c])
train_vecs = scale(train_vecs)


test_vecs = np.concatenate([buildDocHistogram(Vocab_Somind, z, ind_size) for z in x_test_c])
test_vecs = scale(test_vecs)
In [87]:
# #now select the most informative features (here are sominds, but we can do this on original words too)

# def calc_conditional_feature_importance(corpus_mat,labels):
#     #corpus_mat is the original matrix where each row is one record and columns are features, where are either words or sominds
#     #sentiments are labels
#     #it returns a matrix showing the relative importance of each feature regarding to each label
#     pos_coocur_ = np.zeros((corpus_mat.shape[1],1))
#     neg_coocur_ = np.zeros((corpus_mat.shape[1],1))
#     ind_pos = labels==1
#     ind_neg = labels==0
#     for i in range(corpus_mat.shape[1]):
#         pos_coocur_[i] = np.sum(corpus_mat[ind_pos,i])
#         neg_coocur_[i] = np.sum(corpus_mat[ind_neg,i])
#         sum_ = (pos_coocur_[i]+neg_coocur_[i])
#         if sum_ !=0:
#             pos_coocur_[i] = pos_coocur_[i]/sum_
#             neg_coocur_[i] = neg_coocur_[i]/sum_
        
# #             print i
#     DF =pd.DataFrame(index=range(corpus_mat.shape[1]))
#     DF['pos_coocur_'] = pos_coocur_
#     DF['neg_coocur_'] = neg_coocur_
#     DF['differ'] = np.abs(neg_coocur_-pos_coocur_)
#     DF = DF.sort_values('differ',ascending=False)
#     return DF

Predictor

it can be any method

In [91]:
#Use classification algorithm (i.e. Stochastic Logistic Regression) on training set, then assess model performance on test set
from sklearn.linear_model import SGDClassifier
import sklearn.linear_model as lm
lm.RidgeClassifier
from sklearn.decomposition import RandomizedPCA


# howmany = range(10,sm1.nnodes,200)
# # howmany = range(10,15000,500)
# howmany = range(sm1.nnodes,sm1.nnodes+1)
clf = lm.RidgeClassifier()
# from sklearn.neighbors import KNeighborsClassifier
# clf = KNeighborsClassifier(n_neighbors=5)
# clf = lm.SGDClassifier(loss="hinge", alpha=0.01, n_iter=200)

# import sklearn.ensemble as ensemble
# clf = ensemble.RandomForestRegressor(n_jobs=1) 






X_Train = train_vecs[:]
X_Test = test_vecs[:]

# pca = RandomizedPCA(n_components=int(.05*X_Train.shape[1]))
# pca.fit(X_Train)
# X_Train = pca.transform(X_Train)
# X_Test = pca.transform(X_Test)




clf.fit(X_Train, y_train)




import sklearn.metrics as metrics
print metrics.classification_report(y_test,clf.predict(X_Test))
             precision    recall  f1-score   support

        0.0       0.85      0.84      0.85     12542
        1.0       0.84      0.85      0.85     12458

avg / total       0.85      0.85      0.85     25000

Some extensions

One of my previous test application for news analysis

http://todo-vahidmoosavi.rhcloud.com/somnews