Learning Machine Learning – Part 1

What is Machine Learning(ML)? As per a definition given by Tom Mitchell, Machine Learning is the ability of a computer program to improve its Performance(P) at a given task(T) using prior experience(E).

ML problems can be broadly classified into Supervised and Unsupervised learning. These categories have further sub-categories.

  • Supervised – You are given a data set and there is a known relation between input and output. The computer program uses that test data and to learn the relation and use it to predict the output for any given input.
    • Regression – In these set of problems, the output is a continuous function of input, eg. Given a picture of a person, we have to predict their age.
    • Classification – Here, the output is discrete. eg. Given a picture of a person, we have to identify their race/gender etc.
  • Unsupervised – The computer program is not fed with test instances. It first identifies all different groups/classes that the data can be ‘classified’ into. And then use that knowledge to predict where a particular data instance will fit best into.
    • Clustering
    • Non-clustering

Now that we are done with definitions, lets take up a simple regression problem and dive into the mathematics involved to arrive at an algorithm(Gradient Descent).

Problem – Given the age(x) of a house,  predict its price(y).

Lets assume we are given a data set of 10,000 houses with their age and current market price. So test data for our ML program will be of the form (xi, yi) where i ∈ [1,10000]. Now we will feed these data instances to our learning algorithm and come out with a predictor function, h(x) = y = θ0 + θ1x, where θ0, θ1 are variables that we need to find such that the predicted value of y is closest to the actual y.

h(x) is known as hypothesis function.

A diagram will make things easier…

points_for_linear_regression1-1

This is a plot of y against x for all the test instances. Our objective is to find a straight line such that average distance of each data point from the line is minimized. That line can be represented by the equation, y = θ0 + θ1x, where θand θare respectively, the y-intercept and the slope.

points_for_linear_regression1

To find such line, we will use the mean squared error method.

\operatorname {MSE}={\frac {1}{n}}\sum _{{i=1}}^{n}({\hat {Y_{i}}}-Y_{i})^{2}

where Y hat is the predicted value for the ith  instance and Y is the actual value.

Lets call this function, our cost function J(θ0, θ1).

doc-24-nov-2016-5-43-pm

Machine Learning for Text Extraction

In a previous post we looked at the use of Natural Language Processing techniques in text extraction. Several steps are involved in the processing as each document passes through a pipeline of chained tasks.

A deep pipeline can take several seconds for a document. So if one is dealing with thousands of documents an hour the processing requirements could make the system nonviable. Care needs to be taken to evaluate the trade-off between the improvements in accuracy caused by adding pipeline tasks with the additional processing power that it entails.

One reason for the slow speed in our email processing is that we are parsing the entire email and all emails regardless of whether they are of importance to use. In our case only 2% of the emails received will be of interest. So we would like to reduce the amount of text we process by ignoring the unwanted stuff. This process of weeding out irrelevant text should itself not take too long otherwise our purpose is lost!

Machine Learning (ML), which is a key area in AI, offers a solution. GATE comes with various machine learning Processing Resources implementing common ML algorithms like Support Vector Machine (SVM), Bayes classification and K-nearest neighbor (KNN). You “train” the algorithm using training sets of text samples.

Training is done by manually classifying sentences in a binary fashion: is this sentence of interest to me or not? Ideally you need thousands of representative sentences. The algorithm is then trained on this data: internally the various features and annotations are used to reverse engineer patterns based on the manual classification.

In production you first run your input text through the Machine Learning pipeline task. If it predicts that the text is of interest then you run it through the rest of the pipeline, otherwise ignore it. The problem is that this prediction is probabilistic. There could be two kinds of mistakes, one where it wrongly tells you that a dud document is of interest, causing wasted CPU cycles. A more troublesome mistake is when a valid document is marked as of no interest.

In our case for example this is an unacceptable error. We will miss reporting valid events to customers and they will no longer be able to rely on our service to do so. Unfortunately ML algorithms are such that these two types of errors cannot be reduced independently: if you want all valid documents you also get a lot of duds eating up your cpu cycles.

In addition ML can give you strange results. Bad data in your training sets can have a significant impact on your results. Debugging such issues is very difficult because of the non-deterministic nature of learning algorithms. A lot of trial and error is involved, mostly tedious work manually annotating documents, running different training sets and validating the results on real data.

However as in the deterministic NLP process using JAPE the result is magic. Once you have your training sets clean and complete the ML task can significantly weed out unwanted documents. Iteratively adding runtime learning to the system (where you enhance the training sets as you go along) can add dramatic improvements over time.

After the first experience with email parsing we are now using NLP in another project. We have a product for recruiters where resume parsing is an important piece. It currently parses candidate information using regular expressions and string matches.

The accuracy is around 80% for basic information which is a problem since 1 out of 5 fields is missed or wrong. Using a slightly different pipeline from the one described above and building in some heuristic in a custom PR we have been able to get to over 95% accuracy in the lab. In addition we are now extracting several other types of information which was considered too difficult to do using traditional programming.

Our experiences have made us look at other aspects of NLP like collaborative filtering and content-based recommendation engines as well as enhanced search using NL techniques. You might see a post on this soon!