Basic Machine Learning with SciKit-Learn

This is the fourth post in a series based off my Python for Data Science bootcamp I run at eBay occasionally. The other posts are:

In this post we will look into the basics of building ML models with Scikit-Learn. Scikit-Learn is the most widely used Python library for ML, especially outside of deep learning (where there are several contenders and I recommend using Keras, which is a package that provides a simple API on top of several underlying contenders like TensorFlow and PyTorch).

We'll proceed in this fashion:

  • give a brief overview of key terminology and the ML workflow
  • illustrate the typical use of SciKit-Learn API through some simple examples
  • discuss various metrics that can be used to evaluate ML models
  • dive deeper with some more complex examples
  • look at the various ways we can validate and improve our models
  • discuss the topic of feature engineering - ML models are good examples of "garbage in, garbage out", so cleaning our data and getting the right features is important
  • finally, summarize some of the main model techniques and their pros and cons

Read more…

Exploratory Data Analysis with NumPy and Pandas

This is the third post in a series based off my Python for Data Science bootcamp I run at eBay occasionally. The other posts are:

This is an introduction to the NumPy and Pandas libraries that form the foundation of data science in Python. These libraries, especially Pandas, have a large API surface and many powerful features. There is now way in a short amount of time to cover every topic; in many cases we will just scratch the surface. But after this you should understand the fundamentals, have an idea of the overall scope, and have some pointers for extending your learning as you need more functionality.


We'll start by importing the numpy and pandas packages. Note the "as" aliases; it is conventional to use "np" for numpy and "pd" for pandas. If you are using Anaconda Python distribution, as recommended for data science, these packages should already be available:

In [1]:
import numpy as np
import pandas as pd

We are going to do some plotting with the matplotlib and Seaborn packages. We want the plots to appear as cell outputs inline in Jupyter. To do that we need to run this next line:

Read more…

Using Jupyter

This is the second post in a series based off my Python for Data Science bootcamp I run at eBay occasionally. The other posts are:

Jupyter is an interactive computing environment that allows users to create heterogeneous documents called notebooks that can mix executable code, markdown text with MathJax, multimedia, static and interactive charts, and more. A notebook is typically a complete and self-contained record of a computation, and can be converted to various formats and shared with others. Jupyter thus supports a form of literate programming. Several of the posts on this blog, including this one, were written as Jupyter notebooks. Jupyter is an extremely popular tool for doing data science in Python due to its interactive nature, good support for iterative and experimental computation, and ability to create a finished artifact combining both scientific text (with math) and code. It's easiest to start to understand this by looking at an example of a finished notebook.

Jupyter the application combines three components:

Read more…

The 5-Factor Model of Personality

Shankar Vedantam has a great NPR show/podcast, "The Hidden Brain", and occasional appearances on NPR's All Things Considered. In December he had a show on Evaluating Personality Tests. It was enjoyable, especially the Harry Potter Sorting Hat references, but I felt it was a missed opportunity because of the focus on Myers-Briggs, and the fact that he mentioned the Big-5 model only in passing.

In fact, Myers-Briggs is not taken very seriously in the psychology world, and Vedantam surprised me with spending so much time on it, given his show's focus on research in psychology. On the other hand, the Big-5 model is taken quite seriously, with many studies and papers based on it and evaluating it in various contexts (take a look, for example, at the Oxford University Press book I link to at the end of this post).

In the short form NPR segment, this was the section on Big-5 in its entirety:

VEDANTAM: Many personality researchers put greater stock in a test known as the Big Five [vs Myers-Briggs]. Grant says the Big Five has lots of peer reviewed data to back it up.

GRANT: We can predict your job performance, your effectiveness in a team with different collaborators, your likelihood of sticking around in a job versus leaving as well as your probability of your marriage surviving, depending on the personality fit between you and your spouse.

Read more…

A Python Crash Course

I've been teaching a crash course in data science with Python, which starts off with learning Python itself. The target audience is Java programmers (generally senior level) so its assumed that things like classes and methods are well understood. The focus is mostly on what is different with Python. I teach it using Jupyter notebooks but the content is useful as a blog post too so here we go.

The other parts are:


Python's Origins

Python was conceived in the late 1980s, and its implementation began in December 1989 by Guido van Rossum at Centrum Wiskunde & Informatica (CWI) in the Netherlands as a successor to the ABC language. It takes its name from Monty Python's Flying Circus.

Python is a dynamic language but is strongly typed (i.e. variables are untyped but refer to objects of fixed type).

Read more…

Blogging again

Well, it's been quite a while since I last blogged. My Zite project is not dead; it's actually up and running well as a personal aggregator but not ready for multi-user access, and I'm not sure when it might be. But I've been feeling a bit of an itch to start blogging again so here goes.

I have some material already lined up: I've been teaching an introductory data science bootcamp at work and thought the notebooks from that could be useful blog posts in and of themselves. So I'll start slowly publishing those while I write some new material. I'm also going to expand the scope of this blog; I'll still cover some tech topics but I',m going to fold in the content from my dormant math blog and retire it; this may inspire me to do some math blogging again. And I'll be throwing in some stuff on management and psychology too. So this will be a mishmash living up to the random forest name. I'll use categories to make it more accessible for those only interested in specific topics.

More soon!

Graham Wheeler on

Building a Zite Replacement (Part 11)

It's been a while since I worked on this but it is still on my mind a lot. I've been mulling over ways to improve categorization without the semi-supervised tweaking I've had to do.

Just to recap, currently this is what I am doing:

  • I have a bunch of 'category exemplars', which are sets of key terms associated with a category. These are the things which currently require some manual work;
  • for each article, I extract the plain text, normalize capitalization, remove stop words, then use tf-idf to extract the set of most significant terms (I'm not yet doing stemming although I'll probably start);
  • I then use a distance metric from the exemplars to assign category scores to the articles. Provided the score exceeds a threshold the article will be considered to be in the category.

    Read more…

Graham Wheeler on

Using Jupyter as a Music Notebook

I recently started playing guitar again after a long absence and wanted to start making some notes in a digital form. Unfortunately, I didn't find any good tools. There is TeX of course, which can do anything, but I was hoping for something a bit more WYSIWYGy. There are some very good tools available for musical scores (MuseScore, Frescobaldi), but I want something that is more like a traditional notebook with lots of notes interspersed with occasional musical notation (in both traditional and tablature forms).

So an obvious potential candidate is Jupyter (nee IPython), but it has no support for musical notation out of the box. But it is doable and in this post I'll walk through how I got it to work on my Mac. This is also my first attempt at using a Jupyter notebook as my blog post in Nikola so I'm kiling two birds with one stone.

Read more…

Building a Zite Replacement (Part 10)

I've spent the past few days refining the web server, largely for diagnostic purposes, so it can replace the old TkInter app. I can seen articles for categories or feeds, their rank, and detailed information on why they received particular categories and ranks. This has enabled me to improve the categorization and ranking algorithms. I'm at a point now where I feel I need a lot more sources than the ~4000 I have right now, as well as more categories. The latter is more complex and will take me back to some of my earlier explorations in clustering, etc. The former largely involves mining more of the web to find useful sites.

Read more…

Graham Wheeler on

Building a Zite Replacement (Part 9)

Well, I hope you've all brushed your teeth after all that Halloween candy.

Today I'm going to show how I build a simple web server to view my feed articles using node.js and Express, along with MongoDB. I have a simple category classifier which finds the best Jacard similarity (described earlier) to a set of category exemplars (i.e. 'pseudo-articles' for a category containing just key words that are typical for that category). It needs a lot of tuning and the earlier tkInter program was meant for that but tkInter proved to have problems. So time to use some more modern techologies!

Read more…

Graham Wheeler on