The last post in this series covered the Five Factor Model of personality. In this post we'll dig into personality patterns that people can exhibit. Everyone has some combination of the five factors, but how does that combination manifest as a personality type?

There are many different models of personality types, but one used in psychology and psychoanalysis is the categorization in the DSM - the Diagnostic and Statistical Manual of Mental Disorders. This is a somewhat controversial publication that categorizes a number of maladaptive personality categories, and there are schools of thought in psychoanalysis who use similar categories in adaptive forms to describe similar personality types in less extreme forms; the most common forms here are captured in the Psychodynamic Diagnostic Manual.

We're treading on some slippery ground here in my opinion, but as long as you consider this as just a model it can offer some useful aggregate insights.

Graham Wheeler on

Jerry Muller recently wrote a popular book titled "The Tyranny of Metrics". He makes a number of good arguments for why metrics, if not used properly, can have unintended consequences. For example, the body count metric that the US military optimized for in the Vietnam war caused enormous damage while losing the hearts and minds of the populace and resulting in an ignominious defeat. Muller argues that metrics are too often used as a substitute for good judgment. The book is an excellent read.

So should we be ignoring metrics? Clearly not, but we need to be cognizant of what metrics we choose and how we use them. We should also distinguish between things which can meaningfully be measured quantitatively versus things that are more suited to qualitative analyses. And we should be wary of metrics being used as an instrument of control by those far removed from the "trenches", so to speak.

Assuming that we have a problem that can meaningfully be measured in a quantitative way, we need to make sure our metrics meet a number of criteria to be useful. Here are some guidelines:

• metrics should be actionable: they should tell you what you should be doing next. If you can't answer the question of what you would do if a metric changed then its probably not a good metric.
• metrics should be clearly and consistently defined: changing the definition of a metric can invalidate all the historical record and is very costly. Do the work upfront to make sure the metric is well-defined and measuring what you want, and then don't change the definition unless you can do so retroactively. Ensure that the metric is defined and used consistently across the business.
• metrics should be comparative over time (so it is useful to aggregate these over fixed periods like week-over-week or monthly - but be cognizant of seasonal effects).
• ratios are often better than absolute values as they are less affected by exogenous factors. Similarly, trends are more important than absolute values.
• metrics are most useful if they are leading indicators so you can take action early. For example, Work in Progress (WIP) is a leading indicator, while cycle time is a trailing indicator. Think of a highway: you can quickly tell if it is overcrowded, before you can tell how long your commute has been delayed.
• good metrics make up a hierarchy: your team's metrics should roll up to your larger division's metrics which should roll up to top-level business metrics.
• metrics should be in tension: you should try to find metrics that cannot be easily gamed without detrimentally affecting other metrics. Let's say I have a credit risk prediction model and my metric is the number of customers I predict are not credit risks but that default on their debt. I can just predict that every customer is high risk and my metric will look great, but that's bad for the business. So I need another metric that is in tension with this, such as the number of good customers I deny credit to, which must be minimized. More generally in prediction models we use the combination of precision and recall.
• metrics should capture different classes of attributes such as quality and throughput.
• you need to know when a deviation in a metric is a cause for concern. A good general guideline is that if a metric deviates more than two standard deviations from the mean over some defined period, you should start paying attention, and more than three standard deviations you should be alarmed.

Thanks to Saujanya Shrivastava for many fruitful discussions over our metrics from which these guidelines emerged.

Graham Wheeler on

It is very common in modern software engineering organizations to use agile approaches to managing teamwork. At both Microsoft and eBay teams I have managed have used Scrum, which is a reasonably simple and effective approach that offers a number of benefits, such as timeboxing, regular deployments (not necessarily continuous but at least periodic), a buffer between the team and unplanned work, an iterative continuous improvement process through retrospectives, and metrics that can quickly show whether the team is on track or not.

Data science work does not fit quite as well into the Scrum approach. I've heard of people advocating for its use, and even at my current team we initially tried to use Scrum for data science, but there are significant challenges. In particular, I like my Scrum teams to break work down to user stories to a size where the effort involved is under two days (ideally closer to half a day). Yes, we use story points, but once the team is calibrated fairly well its still easy to aim for this. Trying to do this for data science work is much harder, especially when it is research work in building new models which is very open-ended.

The approach I have taken with my team is an interesting hybrid that seems to be working quite well and is worth sharing.

This is the fourth post in a series based off my Python for Data Science bootcamp I run at eBay occasionally. The other posts are:

In this post we will look into the basics of building ML models with Scikit-Learn. Scikit-Learn is the most widely used Python library for ML, especially outside of deep learning (where there are several contenders and I recommend using Keras, which is a package that provides a simple API on top of several underlying contenders like TensorFlow and PyTorch).

We'll proceed in this fashion:

• give a brief overview of key terminology and the ML workflow
• illustrate the typical use of SciKit-Learn API through some simple examples
• discuss various metrics that can be used to evaluate ML models
• dive deeper with some more complex examples
• look at the various ways we can validate and improve our models
• discuss the topic of feature engineering - ML models are good examples of "garbage in, garbage out", so cleaning our data and getting the right features is important
• finally, summarize some of the main model techniques and their pros and cons

This is the third post in a series based off my Python for Data Science bootcamp I run at eBay occasionally. The other posts are:

This is an introduction to the NumPy and Pandas libraries that form the foundation of data science in Python. These libraries, especially Pandas, have a large API surface and many powerful features. There is now way in a short amount of time to cover every topic; in many cases we will just scratch the surface. But after this you should understand the fundamentals, have an idea of the overall scope, and have some pointers for extending your learning as you need more functionality.

### Introduction¶

We'll start by importing the numpy and pandas packages. Note the "as" aliases; it is conventional to use "np" for numpy and "pd" for pandas. If you are using Anaconda Python distribution, as recommended for data science, these packages should already be available:

In [1]:
import numpy as np
import pandas as pd


We are going to do some plotting with the matplotlib and Seaborn packages. We want the plots to appear as cell outputs inline in Jupyter. To do that we need to run this next line:

This is the second post in a series based off my Python for Data Science bootcamp I run at eBay occasionally. The other posts are:

Jupyter is an interactive computing environment that allows users to create heterogeneous documents called notebooks that can mix executable code, markdown text with MathJax, multimedia, static and interactive charts, and more. A notebook is typically a complete and self-contained record of a computation, and can be converted to various formats and shared with others. Jupyter thus supports a form of literate programming. Several of the posts on this blog, including this one, were written as Jupyter notebooks. Jupyter is an extremely popular tool for doing data science in Python due to its interactive nature, good support for iterative and experimental computation, and ability to create a finished artifact combining both scientific text (with math) and code. It's easiest to start to understand this by looking at an example of a finished notebook.

Jupyter the application combines three components:

Shankar Vedantam has a great NPR show/podcast, "The Hidden Brain", and occasional appearances on NPR's All Things Considered. In December he had a show on Evaluating Personality Tests. It was enjoyable, especially the Harry Potter Sorting Hat references, but I felt it was a missed opportunity because of the focus on Myers-Briggs, and the fact that he mentioned the Big-5 model only in passing.

In fact, Myers-Briggs is not taken very seriously in the psychology world, and Vedantam surprised me with spending so much time on it, given his show's focus on research in psychology. On the other hand, the Big-5 model is taken quite seriously, with many studies and papers based on it and evaluating it in various contexts (take a look, for example, at the Oxford University Press book I link to at the end of this post).

In the short form NPR segment, this was the section on Big-5 in its entirety:

VEDANTAM: Many personality researchers put greater stock in a test known as the Big Five [vs Myers-Briggs]. Grant says the Big Five has lots of peer reviewed data to back it up.

GRANT: We can predict your job performance, your effectiveness in a team with different collaborators, your likelihood of sticking around in a job versus leaving as well as your probability of your marriage surviving, depending on the personality fit between you and your spouse.

I've been teaching a crash course in data science with Python, which starts off with learning Python itself. The target audience is Java programmers (generally senior level) so its assumed that things like classes and methods are well understood. The focus is mostly on what is different with Python. I teach it using Jupyter notebooks but the content is useful as a blog post too so here we go.

The other parts are:

### Introduction¶

#### Python's Origins¶

Python was conceived in the late 1980s, and its implementation began in December 1989 by Guido van Rossum at Centrum Wiskunde & Informatica (CWI) in the Netherlands as a successor to the ABC language. It takes its name from Monty Python's Flying Circus.

Python is a dynamic language but is strongly typed (i.e. variables are untyped but refer to objects of fixed type).

Well, it's been quite a while since I last blogged. My Zite project is not dead; it's actually up and running well as a personal aggregator but not ready for multi-user access, and I'm not sure when it might be. But I've been feeling a bit of an itch to start blogging again so here goes.

I have some material already lined up: I've been teaching an introductory data science bootcamp at work and thought the notebooks from that could be useful blog posts in and of themselves. So I'll start slowly publishing those while I write some new material. I'm also going to expand the scope of this blog; I'll still cover some tech topics but I',m going to fold in the content from my dormant math blog and retire it; this may inspire me to do some math blogging again. And I'll be throwing in some stuff on management and psychology too. So this will be a mishmash living up to the random forest name. I'll use categories to make it more accessible for those only interested in specific topics.

More soon!

Graham Wheeler on

It's been a while since I worked on this but it is still on my mind a lot. I've been mulling over ways to improve categorization without the semi-supervised tweaking I've had to do.

Just to recap, currently this is what I am doing:

• I have a bunch of 'category exemplars', which are sets of key terms associated with a category. These are the things which currently require some manual work;
• for each article, I extract the plain text, normalize capitalization, remove stop words, then use tf-idf to extract the set of most significant terms (I'm not yet doing stemming although I'll probably start);
• I then use a distance metric from the exemplars to assign category scores to the articles. Provided the score exceeds a threshold the article will be considered to be in the category.