Notes on Machine Learning

Machine Learning is a hot subject right now, thanks to self-driving cars, awesome recommendation systems and personal assistants like Siri. Yet, a clear definition of Machine Learning is still not agreed upon. Add to that a plethora of other similar streams such as Data Science, Data Mining only increases the confusion of a newbie Machine Learning Engineer.

The simple view that I have of a Machine Learning System is as follows:

We build a system that has a goal G, whose performance (or accuracy) is P. Any system that increases its performance P based on experience E, is a Machine Learning System.

Note that I did not include any mathematical definition or symbols. It simply is a very subjective definition. The heart of Machine Learning theory deals with how to increase Performance with more Experience.

Let’s take two examples of Machine Learning systems.

  1. Housing price prediction system

In this system, the goal G is to predict the price of a house based on features such as size of land, number of bedrooms, number of bathrooms etc. It’s performance P is how well our system predicts the price of the house. The experience E that our system gets is a dataset of the features and the corresponding price. Ex: Each entry in the dataset has the land size, number of bedrooms, number of bathrooms and its corresponding price.
<2400, 4, 2, $224,000>, <3700, 5, 3, $524,000>…….

After the system gained experience, we query the system to predict the price of a house by giving it the features of the house: <2900, 3, 1, $??>

2.  System to group similar songs together

Given a list of songs (say, a million songs),  our ML system should automatically group them according to how similar they are. Our goal G here is to collect similar songs together. Our performance P here is a bit unclear, but we can define similarity between two songs based on features like tempo, genre, artist, duration, chords used etc. Then, our system should group songs to maximize the similarity. A dataset might look like this:
<120bpm, rock, Linkin Park, 4:20, <c,d,a,g> , raw signal data>,<150bpm, pop, Katie Perry, 3:19, <g,d,c,g,a>, raw signal data >…….

After this system gained experience (got trained), we can input a new song and the ML system will include that into an existing group.

In case you missed it, there is a fundamental difference between the above ML systems. In house prediction, our system had knowledge of the price of each house in the dataset. Another way to say is that our prediction system gained experience (trained) while being supervised of the expected outcome.

In the second example ML system, there was no such supervision. Our dataset did not explicitly contain the group to which a particular song belonged. The ML system had to decide how many groups there were, as well as to which group each song belonged. In other words, the second ML system gained experience being unsupervised.

The above somewhat contrived examples were given to highlight the two main categories in Machine Learning, namely:

  1. Supervised Learning: Where the training dataset will include data on the feature we are trying to predict. In the example ML system, this feature was price.
  2. Unsupervised Learning: Where the goal of the system is to find patterns and structures in the data. The training dataset will consist of features to be used to understand similarities, patterns etc.

Examples of Supervised Learning:
1. Spam email filtering
A training dataset will look like the one below:
<raw-email-data1, spam>, <raw-email-data2, Not spam>, <raw-email-data3, Not spam>….
A query will look like:
<raw-email-data, ?>

2. Stock price prediction
A training dataset will look like:
<date1, opening price, closing price, competitor price,.. , stock price>,<date2, opening price, closing price, competitor price,.. , stock price>,<date3, opening price, closing price, competitor price,.. , stock price>…..
A query will look like:
<date, opening price, closing price, competitor price,.. , ?>

Examples of Unsupervised Learning:

1. Identifying market segments for a product/brand

2. Categorize News articles

That’s all for now. I’m hoping to continue this subject further by first dealing with Supervised Learning. Comment if you find any errors or if you are unclear on anything 🙂


Coursera course in ML by Andrew Ng – Link
Introduction to Statistical Learning – Link


How the Raspberry Pi boots up

This is an in-detail account of the Raspberry Pi boot process collected from various sources, mainly from the official forums. First, you need to know the RPi does not boot up like a conventional desktop computer. The VideoCore a.k.a the Graphics processor actually boots before the ARM CPU! Anyway, before we get into the details here’s a diagram of the RPi highlighting the Broadcom BCM2835 SoC.

The SoC (or System-on-Chip) contains the ARM CPU, the VideoCore Graphics Processor,  ROM (Read-Only-Memory) chips, the SDRAM and so many other things. Basically, think of a SoC as your Motherboard and CPU compressed together into a single chip.

When you power on your Raspberry Pi, the first bits of code to run is stored in a ROM chip in the SoC and is built into the Pi during manufacture! This is the called the first-stage bootloader. The SoC is hardwired to run this code on startup on a small RISC Core (Reduced Instruction Set Computer). It is used to mount the FAT32 boot partition in your SDCard so that the second-stage bootloader can be accessed. So what is this ‘second-stage bootloader’ stored in the SD Card? It’s ‘bootcode.bin’. You might have seen this file if you had mounted the SD Card in Windows. Now here’s something tricky. The first-stage bootloader has not yet initialized your ARM CPU (meaning CPU is in reset) or your RAM. So, the second-stage bootloader also has to run on the GPU. The bootloader.bin file is loaded into the 128K 4 way set associative L2 cache of the GPU and then executed. This enables the RAM and loads start.elf which is also in your SD Card. This is the third-stage bootloader and is also the most important. It is the firmware for the GPU, meaning it contains the settings or in our case, has instructions to load the settings from config.txt which is also in the SD Card.  You can think of the config.txt as the ‘BIOS settings’ (as is mentioned in the forum). Some of the settings you can control are (thanks to dom):

arm_freq : frequency of ARM in MHz. Default 700.

gpu_freq : Sets core_freq, h264_freq, isp_freq, v3d_freq together.

core_freq : frequency of GPU processor core in MHz. Default 250.

h264_freq: frequency of hardware video block in MHz. Default 250.

isp_freq: frequency of image sensor pipeline block in MHz. Default 250.

v3d_freq: frequency of 3D block in MHz. Default 250.

sdram_freq: frequency of SDRAM in MHz. Default 400.

The start.elf also splits the RAM between your GPU and the ARM CPU. The ARM only has access the to the address space left over by the GPU address space. For example, if the GPU was allocated addresses from 0x000F000 – 0x0000FFFF, the ARM has access to addresses from 0x00000000 – 0x0000EFFF. (These are not real address ranges. It’s just for demonstration purposes). Now what’s even funnier is that the ARM core perceives 0x00005001 as it’s beginning address 0x00000000. In other words, if the ARM core requests the address 0x0000000, the actual address in RAM is 0x00005001. Edit: The physical addresses perceived by the ARM core is actually mapped to another address in the VideoCore (0xC0000000 and beyond) by the MMU (Memory Management Unit) of the VideoCore. The config.txt is loaded after the split is done so you cannot specify the splitting amounts in the config.txt. However, different .elf files having different splits exist in the SD Card. So, depending on your requirement, you can rename those files to start.elf and boot the Pi. (The forums mention of having this functionality in a dynamic fashion, but I don’t know whether they have implemented it yet) [EDIT8/7/2014: As per Andrew’s comment it has been implemented in present firmware]  In the Pi, the GPU is King!

Other than loading config.txt and splitting RAM, the start.elf also loads cmdline.txt if it exists. It contains the command line parameters for whatever kernel that is to be loaded. This brings us to the final stage of the boot process. The start.elf finally loads kernel.img which is the binary file containing the OS kernel (DUH!?) and releases the reset on the CPU. The ARM CPU then executes whatever instructions in the kernel.img thereby loading the operating system.

After starting the operating system, the GPU code is not unloaded. In fact, start.elf is not just firmware for the GPU, It is a proprietary operating system called VideoCore OS. When the normal OS (Linux) requires an element not directly accessible to it, Linux communicates with VCOS using the mailbox messaging system.

Note: Special thanks to user dom  in the official RPi forums and the community behind the official wiki.

Booting the Raspberry Pi without a Monitor or Router – the first time!

If you’ve just got a new Raspberry Pi and don’t have a monitor to try it out, fear not. The latest Debian Wheezy distributions have SSH enabled on default, thus enabling us to log in if we have a net connection to the RPi.

Note: You should have a Linux distribution running since we are going to modify some files in the SD Card which will have a Linux File System.

First, write the latest Debian Wheezy Image in to a SD Card using Win32DiskImager.

Then we are going to open a file in the SD Card so switch to your favorite Linux distribution (Ubuntu anyone? ) Hopefully your SD Card will already be mounted, otherwise google to find how.

I used Fedora 18, so it was mounted in


where ‘/’ signifies the root file system of the your computer (not the Pi). Type “mount” in a terminal to find the SD Card if it’s not displayed in the desktop.
Now we must give our Raspberry Pi an IP Address. This will enable us to connect to the RPi through SSH. By default, the RPi is configured to receive an IP address from a DHCP server. What we’ll do is change this default setting to a static IP address of our choosing. These settings are recorded in a file called ‘interfaces’ in /etc/network/ where ‘/’ signified the root in the Raspberry Pi SD Card.
Therefore to open this file, the full path for the ‘interfaces’ file for me would be


To open it, type the following in the terminal taking care to replace the path with your path. This will open the file in the ‘nano’ text editor.

sudo nano /run/media/sagara/62ba9ec9-47d9-4421-aaee-71dd6c0f3707/etc/network/interfaces

Then comment out the ‘dhcp’ line and modify it like the following. I have chosen to give my RPi the address

auto lo

iface lo inet loopback
#iface eth0 inet dhcp this is a comment
iface eth0 inet static

allow-hotplug wlan0
iface wlan0 inet manual
wpa-roam /etc/wpa_supplicant/wpa_supplicant.conf
iface default inet dhcp

Save and exit nano. Now you have given your Raspberry Pi a static IP address. Since we are not using a router to connect to the Pi, we’ll be using a RJ45 Ethernet cable a.k.a ‘Crossover’ cable. So connect your RPi and Laptop/PC together with the cable.

Now if you know a teensy bit about networking, this won’t sound peculiar; you have to give the Ethernet port on your Laptop/PC and the Ethernet port on your RPi, unique, valid IP addresses which are on the same network. We have already given an IP address and a network address for the Pi. (see the address and netmask values). The only thing remaining is to give the Ethernet port on our Laptop/PC an IP address on the same network. I have chosen to give as the IP address of my laptop.

If you are on Windows, open the Network and Sharing Center and click on the Ethernet connection properties.


Then in the IPV4 properties choose “Use the following IP address” and put,

IP :
Subnet Mask:

Then download PuTTy. It’s a SSH client for windows. Start it and put as the connecting address and now you can connect to your Raspberry Pi!!


You can also do this in Linux. All you have to do is assign a static IP address to the Ethernet port. SSH is already installed on most Linux distributions so PuTTy will not be needed. Here’s a snapshot of SSH-ing to my RPi from Fedora.


After this, you may want install a VNC Server on your Pi. That will enable you to use the screen in your laptop as a monitor for the Raspberry Pi.

Check it out:

Text Classification in WEKA

Text classification can be done using WEKA, an open source machine learning tool. However a lot of people seem to be running into problems when you have training and test sets of data. If you use the stringToWordVector filter and get the dreaded ‘training and test set are not compatible’ error, read on. This error occurs because the word dictionary will change, since word occurrences will differ in training and test set. The generated output will be two incompatible files. The solution to this is batch filtering the train and test sets of data together using stringToWordVector filter. Then the test set will contain the same attributes, thus eliminating the ‘training and test set are not compatible’ error.
NOTE: If the test data set is missing the class attribute, you have to put it there (For arff files: @attribute classname, csv files, append at the beginning: classname). Then for each instance put a “?” as the value for that attribute. (“?” signifies missing values in WEKA).

If you have .csv files here’s how to do it

For each .csv file (the train and test data sets)
1. Open .csv file in WEKA Explorer
2. Convert all attributes except the class attribute(s) to String type.
Ex: If the attributes are nominal click Filter choose filters->unsupervised->attribute->NominalToString. Click the text area of the chosen filter (Generic Object Editor) and in the ‘attributeIndexes’ field type 1-3 if all your non-class attributes exist in attributes 1,2,3. Click Apply to apply the changes.
3. (Optional) Convert your class attribute(s) to nominal type. (Otherwise most classifiers will be disabled)
Ex: If the class attribute is numeric, then click Filter choose filters->unsupervised->attribute->NumericToNominal. Click the text area of the chosen filter (Generic Object Editor) and type in the ‘attributeIndexes’ field the class attribute index (ex: 4). Click Apply to apply the changes.
4. Click save in the WEKA Explorer and the new data set as a .arff format file.

Now you should have two data sets (train and test) in the .arff format. Then we will batch filter them using the stringToWordVector using the following commands in the command line.
NOTE 1: It’s better to copy .arff files to the weka.jar location
NOTE 2: You should have the path to the weka.jar file in the CLASSPATH system variable. More details here.

1. Open the command prompt (start->run-> type ‘cmd’)
2. Browse to the weka.jar location (‘cd ‘ etc )
3. Type the following command

java weka.filters.unsupervised.attribute.StringToWordVector -b -i input_training_set.arff -o output_training_set.arff -c last -r input_test_set.arff -s output_test_set.arff -R 1,2,3 -O -C -T -I -N 0 -M 1

The options denote the following
-b: batch mode
This is useful to filter two datasets at once. The first dataset is used to initialize the filter and the second one is then filtered according to this setup. i.e., your test set will contain the same    attributes then.
-i: training input file
-o: training output file
-r: test input file
-w: test output file
-R 1,2,3: process the first,second,third attributes which is the string attribute, this is by default
-C: output word count rather than boolean word presence
-T: transform term frequency into log(1+tf)
-I: transform word frequency into tf*log(total# of docs/# of docs contain this word) It is actually the tf*idf weight without normalization
-N: 0=not normalize/1=normalize all data/2=normalize test data only to average length of training documents (default 0=don’t normalize).(Detail explanation from Wekalist)
-L: Convert all tokens to lowercase before adding to the dictionary
-A: Only form tokens from contiguous alphabetic sequences (Turn this off when work with phrase!!! After Weka 3-5-6, this option is no more available and is replaced by weka.core.tokenizers.AlphabeticTokenizer)
-S: Ignore words that are in the stoplist. (we don’t use this one since we’ve use our own stop list already)
-M: minimal term frequency, here is 1.

The most important options are -i, -o, -r, -s, -c, -R, -C.

Now you should have two new .arff files. Load them into WEKA and classify the training set (ex: Choose Naive Bayes). In Test Options, choose Supplied Test Set. Click ‘Set’ and choose the newly generated test data set. Click More Options and tick ‘Output Predictions’. Right-click the result in the ‘Result list’ and choose ‘Re-evaluate model on current test set’.

Now you should have the classified data in the results.


Tokenizing, Stopping and Stemming using Apache Lucene

If you want to tokenize, stop and stem a line of string, you can do it using Apache Lucene. Here a sample code which takes a line of input and returns a tokenized, stopped and stemmed output.


is a hash set with a list of stop words. I used the list from here.

import org.apache.lucene.analysis.PorterStemFilter;
import org.apache.lucene.analysis.StopFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.standard.StandardTokenizer;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
import org.apache.lucene.analysis.tokenattributes.TermAttribute;
import org.apache.lucene.util.Version;

private static String tokenizeStopStem(String input) {

        TokenStream tokenStream = new StandardTokenizer(
                Version.LUCENE_36, new StringReader(input));
        tokenStream = new StopFilter(Version.LUCENE_36, tokenStream, stop_word_set);
        tokenStream = new PorterStemFilter(tokenStream);

        StringBuilder sb = new StringBuilder();
        OffsetAttribute offsetAttribute = tokenStream.addAttribute(OffsetAttribute.class);
        CharTermAttribute charTermAttr = tokenStream.getAttribute(CharTermAttribute.class);
            while (tokenStream.incrementToken()) {
                if (sb.length() &gt; 0) {
                    sb.append(" ");
        catch (IOException e){
        return sb.toString();