Object detection with Turi Create and augmentation using ARKit


Over the past few years, the use of Machine Learning to solve complex problems has been increasing. Machine learning (ML) is a field of computer science that gives computer systems the ability to “learn” (i.e. progressively improve performance on a specific task) with data, without being explicitly programmed.

Last year was a good year for the freedom of information, as titans of the industry Google, Microsoft, Facebook, Amazon, Apple and even Baidu open-sourced their ML frameworks. In this blog, let’s explore a framework provided by Apple named Turi Create. Continue reading

Data Fingerprinting to enable Incremental Improvement in Machine Learning Complexity


Many startups would like to incorporate a machine learning component into their product(s). Most of these products are unique in terms of the business, the data that is required to train the machine learning models, and the data that can be collected. One of the main challenges that these startups have is the availability of data specific to their business problem. Unfortunately, the quality of the machine learning algorithms is dependent on the quality of the domain specific data that is used to train these models. Generic data sets are not useful for the unique problems that these startups are solving. As a result, they cannot rollout a feature involving machine learning until they can collect enough data. On the other hand, customers ask for the product feature before their usage can generate the required data. In such a situation, one needs to rollout a machine learning solution incrementally. For this to happen, there must be a synergy between the data and the algorithms that have the ability to process this data. To enforce this synergy, we propose a computational model that we refer to as “Data Fingerprinting”. Continue reading

Handling Categorical Features in Machine Learning

Introduction: Every dataset has two type of variables Continuous(Numerical) and Categorical. Regression based algorithms use continuous and categorical features to build the models. You can’t fit categorical variables into a regression equation in their raw form in most of the ML Libraries. If it is not included in the modeling, then you do not get an accurate model. It’s crucial to learn the methods of dealing with such variables. There are many machine learning libraries that deal with categorical variables in various ways. Approach on how to transform and use those efficiently in model training, varies based on multiple conditions, including the algorithm being used, as well as the relation between the response variable and the categorical variable(s). Here I take the opportunity to demonstrate the various methods prevalent and incorporated in the popular Machine Learning Library in Spark, i.e.Mllib for handling categorical variables. Continue reading

Learning Machine Learning – Part 1

What is Machine Learning(ML)? As per a definition given by Tom Mitchell, Machine Learning is the ability of a computer program to improve its Performance(P) at a given task(T) using prior experience(E).

ML problems can be broadly classified into Supervised and Unsupervised learning. These categories have further sub-categories.

  • Supervised – You are given a data set and there is a known relation between input and output. The computer program uses that test data and to learn the relation and use it to predict the output for any given input.
    • Regression – In these set of problems, the output is a continuous function of input, eg. Given a picture of a person, we have to predict their age.
    • Classification – Here, the output is discrete. eg. Given a picture of a person, we have to identify their race/gender etc.
  • Unsupervised – The computer program is not fed with test instances. It first identifies all different groups/classes that the data can be ‘classified’ into. And then use that knowledge to predict where a particular data instance will fit best into.
    • Clustering
    • Non-clustering

Now that we are done with definitions, lets take up a simple regression problem and dive into the mathematics involved to arrive at an algorithm(Gradient Descent).

Problem – Given the age(x) of a house,  predict its price(y).

Lets assume we are given a data set of 10,000 houses with their age and current market price. So test data for our ML program will be of the form (xi, yi) where i ∈ [1,10000]. Now we will feed these data instances to our learning algorithm and come out with a predictor function, h(x) = y = θ0 + θ1x, where θ0, θ1 are variables that we need to find such that the predicted value of y is closest to the actual y.

h(x) is known as hypothesis function.

A diagram will make things easier…


This is a plot of y against x for all the test instances. Our objective is to find a straight line such that average distance of each data point from the line is minimized. That line can be represented by the equation, y = θ0 + θ1x, where θand θare respectively, the y-intercept and the slope.


To find such line, we will use the mean squared error method.

\operatorname {MSE}={\frac {1}{n}}\sum _{{i=1}}^{n}({\hat {Y_{i}}}-Y_{i})^{2}

where Y hat is the predicted value for the ith  instance and Y is the actual value.

Lets call this function, our cost function J(θ0, θ1).