Friday 8 May 2015

K-Means Clustering Code

We have written a code for the K-Means clustering of the mfccs we have obtained. We are yet to test the code, that will be our next task to accomplish in the coming week.

#train1.txt - contains all mfccs of all the categories' samples (FIRST frame coefficients only)

def dv(p1, p2): #takes two points as input
return ( math.sqrt((p1[0]*p1[0] - p2[0]*p2[0]) + ( p1[1]*p1[1] - p2[1]*p2[1])))

def mindv(p):
min = 0
mini = 0
for i in range(len(centroids)):
temp = dv(p, centroids[i])
if(temp<max):
temp = max
mini = i
rerurn mini

def meanPair(lst):
meanValX=0
meanValY=0
for l in lst:
meanValX+=l[0]
meanValY+=l[1]
return [float(meanValX)/len(lst),float(meanValY)/len(lst)]

def costFunc(cen,cls,m):
J=0.0
for l in range(len(cls)):
for p in cls[l]:
J+=(p[0]-cen[l][0])**2
J+=(p[1]-cen[l][1])**2
J=float(J)/m
return J

fd = open("train1.txt", "r")
L1 = []
for l in fd.read().split("\n"):
L1.append([float(i) for i in l.split(',')])

pairs = []



fd = open("train2.txt", "r")
L2 = []
for l in fd.read().split("\n"):
L2.append([float(i) for i in l.split(',')])

for i in range(len(L1)):
for j in range(13):
pairs.append([L1[i][j],L2[i][j]]

k=4
centroids=[]
for i in range(k):
centroids.append(pairs[random(0,len(pairs))])

oldJ=999999

#Loop Here
while 1:
C = [] #classification
for i in range(k):
C.append([])

#uncertain :P
for p in pairs:
C[mindv(p)].append(p)
#end uncertain :P

newJ=costFunc(centroids,C,len(pairs))
if((float(oldJ)-float(newJ))/float(oldJ))<=0.05:
break

centroids=[]
for i in range(k):
centroids.append(meanPair(c[i]))

print centroids, C

Friday 17 April 2015

Performance Measure of Naive Bayes Classifier

Performance Measure
Naive Bayes classifier works better when the data set is small because of its low variance. It follows a simple algorithm which basically is related to counting. An NB classifier will give faster output in case the NB conditional independence assumption holds and in the case it when it doesn’t, an NB classifier still performs much better than expected more often than not. A naive Bayes classifier assumes that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature. Its a good choice when some kind of semi-supervised learning is needed.

Examples and Test Cases
Naive Bayes Classifier is good for the text classification problem of email spam filtering: classifying email messages into spam and non-spam . Since often a document is represented as a bag of words, text classifiers often don’t use any kind of deep representation about language:. This is an extremely simple representation: it only knows which words are included in the document along with their occurrences, and doesn’t store the word order.


The inclusion of strong feature independence assumptions makes it unsuitable for Speech Recognition. Consider a model that uses the average sentence length as a feature amongst others. Now, if we add some features modeling syntactic complexity of sentences in a text, such features may add new cues to the model, but syntactic complexity also has a correlation with sentence length. In such situations, naive Bayes models may fail, since they treat all features as independent.

Testing A Model

Once the training phase is completed, the next phase entered is the testing phase. In this phase, the testing data (new samples of data which is not present in the training set) is tested to check for an output, i.e., this is the phase where the algorithm chosen is actually run on the test data. The algorithm chosen is run on the test data and an output is obtained. For instance, if the input is a sound sample that needs to be classified as human or dog, the algorithm uses the learned parameters from training set in order to identify the given test sample.

For this, we need to select a classifier appropriate to the application on hand. Different classification algorithms yield varying performances for different applications. For example, the a classifier may be good for text processing but not for audio processing. A few classifiers may be identified and each of them may be tried and evaluated by their accuracy in identifying the test data correctly and the one with the best results may be used. The testing phase is dependent on the training phase for its performance. For example, care must be taken to ensure that the samples in the training set contain sufficiently clear and relevant data (in case of speech processing, pure samples of speech with the required words, less distortion and background noise).

The testing phase also determines the performance of the model. 

Scatter Plots

Below are the scatter plots of two samples of dogs barking:



 Here are the scatter plots of human speech samples:




Report - What It Means To Train A Model

Training a model is the process of getting a machine to "learn" the data by supplying an adequate training set that consists of relevant data such that the model will now be able to recognize new instances of the same type of data which did not belong to the supplied training data set. This is done by pairing the input data with the expected output. For example, if we must train a model to recognize human speech of the English language, we must supply the model with a training data set with a large number of people speaking English. With the help of this training set, the machine must be taught to learn that the samples are of human speech by associating them all with some common patterns in their feature vectors (for example, MFCCs). Once a model is believed to be trained, it must be tested with a test data. In the context of the aforementioned example, this could consist of a few more samples of human speech that are not present in the training data set.

Training a model can be accomplished by using a supervised algorithm, an unsupervised algorithm or a semi-supervised algorithm. A supervised algorithm helps a machine to infer from the data. The data provided in the training set is labelled. An unsupervised algorithm on the other hand enables the machine to learn the data on its own by finding a hidden pattern or organization in the data (the data is unlabelled). In semi-supervised learning, both, labelled and unlabelled data is used. Commonly, a small amount of labelled data is used along with a large amount of unlabelled data, where the labelled data is made use of to understand or learn the structure of the unlabelled data. 

The training phase is extremely important, since the datasets used in the process will largely influence the machine’s ability to learn, and the machine’s performance (measured in terms of how many test cases are identified correctly or erroneously) and efficiency (in terms of speed and energy utilization). The datasets must be large enough and in the case of speech and sound applications, diverse enough. Most often, the richer the dataset used in the training phase, the more accurate the results.

Tuesday 7 April 2015

The tasks to be done this week:

1)Get the scatter plots for different sound samples and try to separate them manually.

2) Report on classification .

3)Classify data on weka and compare with known results.

Wednesday 1 April 2015

Using Weka - First Example We Ran

We created a .arff file with the following small sample data-set about housing. It has six attributes - housingSize, lotSize, bedrooms, granite, bathroom and sellingPrice.


Selecting this arff file in Weka Explorer :


Selecting the Linear Regression classifier, 


We used help from the following link: http://www.ibm.com/developerworks/library/os-weka2/