# Machine Learning (a primer)

(Portland Python, March 2008) (john melesky)

# In a nutshell

Take facts, turn it into knowledge.

# In a nutshell

Take facts, turn it into knowledge, algorithmically.

# Discovering things

Also known as "unsupervised learning", it's what you do when you have a whole lot of unstructured data you know little about.

The problem: check the spelling of things that aren't in the dictionary

The problem: check the spelling of things that aren't in the dictionary

Indigo Montoya
Inigo Montana
Inigo Montoya
Neego Montoya
Inigo Mantoya

# Spellcheck

Numbers we have include: number of times a query is made, distance between queries (e.g., Levenshtein distance)

# Spellcheck

Numbers we have include: number of times a query is made, distance between queries (e.g., Levenshtein distance)

When a new query comes in, find the most common query within a short distance and suggest it.

# Spellcheck

Numbers we have include: number of times a query is made, distance between queries (e.g., Levenshtein distance)

When a new query comes in, find the most common query within a short distance and suggest it.

And that's it.

# Clustering

Problem: given a big pile of documents, figure out what different categories there are.

# Clustering

Solution: simple geometry

# Clustering

Solution: simple (high-dimensional) geometry

# Clustering

Solution: (a whole lot of) simple (high-dimensional) geometry

# Technique: k-Means Clustering

1. Pick some (k) random points in your vector space.
2. For each document, figure out the nearest point.
3. Lather, rinse, repeat.
4. Voila! Slow-cooked category discovery

# Supervised Learning

When you already know something about your data, and you want to apply that knowledge to more, less-known data

# Classification

You have 100 documents in two different categories. Predict the category for the next 5000 documents.

# Technique: Nearest Neighbor

2. Figure out the closest known to your unknown (geometrically)

# Technique: Linear Separation

2. Figure out a line separating the categories
3. Use that line to classify the unknowns

# Linear Separation: Step 1 # Linear Separation: Step 2 # Non-linearly separable data # Non-linearly separable data # Non-linearly separable data # Technique: Support Vector Machines # Technique: Naive Bayesian Classifiers

Not geometric, but statistical.

# Bayes' Theorem

Future probabilities derived from prior probabilities

# Bayes' Theorem

If a drug test has 95% accuracy, and Bob tests positive, what is the probability that he uses drugs?

# Bayes' Theorem

If a drug test has 95% accuracy, and Bob tests positive, what is the probability that he uses drugs?

(hint: it's not 95%)

# Bayes' Theorem

Answer: Depends on how many people use drugs.

# Bayes' Theorem

Answer: Depends on how many people use drugs.

If the rate of drug use is 1%, then we have:

test positivetest negative
users95% of 1%5% of 1%
non-users5% of 99%95% of 99%

# Bayes' Theorem

Answer: Depends on how many people use drugs.

Number of positive results: 0.95% + 4.95% == 5.9%

Number of correct positive results: 0.95% / 5.9% == 16.1%

# The basic process

1. Look at your data, figure out a good numeric representation
2. Turn your data into numbers (usually vectors of numbers)