April 2016 – The Kandyan Code

Machine Learning is a hot subject right now, thanks to self-driving cars, awesome recommendation systems and personal assistants like Siri. Yet, a clear definition of Machine Learning is still not agreed upon. Add to that a plethora of other similar streams such as Data Science, Data Mining only increases the confusion of a newbie Machine Learning Engineer.

The simple view that I have of a Machine Learning System is as follows:

We build a system that has a goal G, whose performance (or accuracy) is P. Any system that increases its performance P based on experience E, is a Machine Learning System.

Note that I did not include any mathematical definition or symbols. It simply is a very subjective definition. The heart of Machine Learning theory deals with how to increase Performance with more Experience.

Let’s take two examples of Machine Learning systems.

Housing price prediction system

In this system, the goal G is to predict the price of a house based on features such as size of land, number of bedrooms, number of bathrooms etc. It’s performance P is how well our system predicts the price of the house. The experience E that our system gets is a dataset of the features and the corresponding price. Ex: Each entry in the dataset has the land size, number of bedrooms, number of bathrooms and its corresponding price.
<2400, 4, 2, $224,000>, <3700, 5, 3, $524,000>…….

After the system gained experience, we query the system to predict the price of a house by giving it the features of the house: <2900, 3, 1, $??>

2. System to group similar songs together

Given a list of songs (say, a million songs), our ML system should automatically group them according to how similar they are. Our goal G here is to collect similar songs together. Our performance P here is a bit unclear, but we can define similarity between two songs based on features like tempo, genre, artist, duration, chords used etc. Then, our system should group songs to maximize the similarity. A dataset might look like this:
<120bpm, rock, Linkin Park, 4:20, <c,d,a,g> , raw signal data>,<150bpm, pop, Katie Perry, 3:19, <g,d,c,g,a>, raw signal data >…….

After this system gained experience (got trained), we can input a new song and the ML system will include that into an existing group.

In case you missed it, there is a fundamental difference between the above ML systems. In house prediction, our system had knowledge of the price of each house in the dataset. Another way to say is that our prediction system gained experience (trained) while being supervised of the expected outcome.

In the second example ML system, there was no such supervision. Our dataset did not explicitly contain the group to which a particular song belonged. The ML system had to decide how many groups there were, as well as to which group each song belonged. In other words, the second ML system gained experience being unsupervised.

The above somewhat contrived examples were given to highlight the two main categories in Machine Learning, namely:

Supervised Learning: Where the training dataset will include data on the feature we are trying to predict. In the example ML system, this feature was price.
Unsupervised Learning: Where the goal of the system is to find patterns and structures in the data. The training dataset will consist of features to be used to understand similarities, patterns etc.

Examples of Supervised Learning:
1. Spam email filtering
A training dataset will look like the one below:
<raw-email-data1, spam>, <raw-email-data2, Not spam>, <raw-email-data3, Not spam>….
A query will look like:
<raw-email-data, ?>

2. Stock price prediction
A training dataset will look like:
<date1, opening price, closing price, competitor price,.. , stock price>,<date2, opening price, closing price, competitor price,.. , stock price>,<date3, opening price, closing price, competitor price,.. , stock price>…..
A query will look like:
<date, opening price, closing price, competitor price,.. , ?>

Examples of Unsupervised Learning:

1. Identifying market segments for a product/brand

2. Categorize News articles

That’s all for now. I’m hoping to continue this subject further by first dealing with Supervised Learning. Comment if you find any errors or if you are unclear on anything 🙂

References:

Coursera course in ML by Andrew Ng – Link
Introduction to Statistical Learning – Link

The Kandyan Code

127.0.0.1 of Sagara Paranagama

Month: April 2016

Notes on Machine Learning