Unit Tests that Don't Suck

Introduction

This post is based on a talk I gave to my team in an effort to establish a common approach to thinking about unit tests. The existing code base we had suffered from a number of problems relating to how tests were being written; despite good intentions, it can be easy to do testing badly. In particular, here are some of the things I observed:

  • a massive overuse of dependency injection: pretty much all dependencies of all classes were being set up using DI. I believe this was related to the next point about the overuse of mocks. DI is certainly useful, especially when wiring up a high-level architecture, but it does not need to be used for everything;
  • a massive overuse of mocking. This was the main problem I observed. I went so far as to write a simple Python script to generate some metrics around this, and in the most egregious cases I found tests that had over 60 mocked classes being created. At that level it is very difficult to even know what is being tested; are you just testing that your mocks exhibit the behavior you specified in the mocks? What real behavior is being tested?
  • an overuse of classes. This was TypeScript code; top-level functions are allowed, and can be encapsulated in namespaces; you don't have to put everything in a class;
  • treating unit tests as testing units of code rather than units of behavior. This meant a lot more internal implementation details were being exposed than necessary, and there was too much "unit testing" and not enough functional and integration testing;
  • there was disagreement within the team about which of the above were problematic and a lack of consistency in approach that was unproductive.

As a result, writing tests was hard, code reviews could be acrimonious, the design was overcomplicated and opaque, and the tests were not actually even testing that much. This was despite the fact that a lot of effort was going in to writing tests. It was clear that we needed to do better, and that meant starting out with a common approach and more education for junior developers who would otherwise just emulate existing patterns and practices. To address this, I started by giving a talk to the entire team on better testing practices. The below is my opinion, but I can attest to the fact that after these practices were adopted in a large new piece of code, we got major improvements in ease of test writing, effectiveness of tests, comprehensibility of tests, and overall architecture.

Read more…

Prioritization, Estimating and Planning

This post came out of a talk I gave to a group of mentees, prompted by questions they had around how to do estimation and how to know they were working on the right priorities. These are complex questions to which there are no single answers, but I aimed to give them some tools that could help.

Prioritizing

“If it’s a priority you’ll find a way. If it isn’t, you’ll find an excuse.” - Jim Rohn

Most prioritization techniques involve balancing "costs" versus "benefits". How we define "cost" or benefit can vary. There are obviously very complex ways of doing this, through, say, econometric models, but in most cases we can "SWAG" our way to a relative ordering using pretty basic two-dimensional models, of which we will look at a few here:

Read more…

Graham Wheeler on

Flow

"A bad system will beat a good person every time" - Edwards Deming

This post is based on a tech talk I gave at eBay in early 2018. eBay had gone through a company-wide transformation to agile processes (where before this had been team-specific) and the main points I wanted to make where that it was important to make the hidden things the consumed people's time visible, explicit, and properly prioritized, if we want to improve throughput or flow. Note that this is not the same flow as Csíkszentmihályi's "being in the zone", but being in the zone can improve throughput.

It was a fun talk not least because of the inclusion of the "I Love Lucy" video. The original talk material was mostly bullet points; I have changed it to have a somewhat better narrative structure but haven't changed it too much.


We have too much Work in Progress (WIP)!

We take on too much work in progress (WIP) for many reasons, including:

  • It's hard to say no. We want to please people and be team players, and we defer to authority. So we say yes to random requests and our sprints are disrupted by unplanned work.
  • We frequently underestimate the effort involved in tasks, and we often have unclear priorities. So we end up having to scramble to meet deadlines because we have been working on the wrong things or gave bad dates that were communicated up and/or out and are now hard deadlines.
  • New things are often more fun to start than existing things are to finish, so we get to 90% done and then don't finish properly.

But too much WIP comes with a lot of associated problems:

Read more…

Graham Wheeler on

Personality Patterns

Photo by Rex Pickar on Unsplash

The last post in this series covered the Five Factor Model of personality. In this post we'll dig into personality patterns that people can exhibit. Everyone has some combination of the five factors, but how does that combination manifest as a personality type?

There are many different models of personality types, but one used in psychology and psychoanalysis is the categorization in the DSM - the Diagnostic and Statistical Manual of Mental Disorders. This is a somewhat controversial publication that categorizes a number of maladaptive personality categories, and there are schools of thought in psychoanalysis who use similar categories in adaptive forms to describe similar personality types in less extreme forms; the most common forms here are captured in the Psychodynamic Diagnostic Manual.

We're treading on some slippery ground here in my opinion, but as long as you consider this as just a model it can offer some useful aggregate insights.

Read more…

Graham Wheeler on

The "Tyranny" of Metrics

Photo by Carlos Muza on Unsplash

Jerry Muller recently wrote a popular book titled "The Tyranny of Metrics". He makes a number of good arguments for why metrics, if not used properly, can have unintended consequences. For example, the body count metric that the US military optimized for in the Vietnam war caused enormous damage while losing the hearts and minds of the populace and resulting in an ignominious defeat. Muller argues that metrics are too often used as a substitute for good judgment. The book is an excellent read.

So should we be ignoring metrics? Clearly not, but we need to be cognizant of what metrics we choose and how we use them. We should also distinguish between things which can meaningfully be measured quantitatively versus things that are more suited to qualitative analyses. And we should be wary of metrics being used as an instrument of control by those far removed from the "trenches", so to speak.

Assuming that we have a problem that can meaningfully be measured in a quantitative way, we need to make sure our metrics meet a number of criteria to be useful. Here are some guidelines:

  • metrics should be actionable: they should tell you what you should be doing next. If you can't answer the question of what you would do if a metric changed then its probably not a good metric.
  • metrics should be clearly and consistently defined: changing the definition of a metric can invalidate all the historical record and is very costly. Do the work upfront to make sure the metric is well-defined and measuring what you want, and then don't change the definition unless you can do so retroactively. Ensure that the metric is defined and used consistently across the business.
  • metrics should be comparative over time (so it is useful to aggregate these over fixed periods like week-over-week or monthly - but be cognizant of seasonal effects).
  • ratios are often better than absolute values as they are less affected by exogenous factors. Similarly, trends are more important than absolute values.
  • metrics are most useful if they are leading indicators so you can take action early. For example, Work in Progress (WIP) is a leading indicator, while cycle time is a trailing indicator. Think of a highway: you can quickly tell if it is overcrowded, before you can tell how long your commute has been delayed.
  • good metrics make up a hierarchy: your team's metrics should roll up to your larger division's metrics which should roll up to top-level business metrics.
  • metrics should be in tension: you should try to find metrics that cannot be easily gamed without detrimentally affecting other metrics. Let's say I have a credit risk prediction model and my metric is the number of customers I predict are not credit risks but that default on their debt. I can just predict that every customer is high risk and my metric will look great, but that's bad for the business. So I need another metric that is in tension with this, such as the number of good customers I deny credit to, which must be minimized. More generally in prediction models we use the combination of precision and recall.
  • metrics should capture different classes of attributes such as quality and throughput.
  • you need to know when a deviation in a metric is a cause for concern. A good general guideline is that if a metric deviates more than two standard deviations from the mean over some defined period, you should start paying attention, and more than three standard deviations you should be alarmed.

Thanks to Saujanya Shrivastava for many fruitful discussions over our metrics from which these guidelines emerged.

Graham Wheeler on

Managing Engineering and Data Science Agile Teams

It is very common in modern software engineering organizations to use agile approaches to managing teamwork. At both Microsoft and eBay teams I have managed have used Scrum, which is a reasonably simple and effective approach that offers a number of benefits, such as timeboxing, regular deployments (not necessarily continuous but at least periodic), a buffer between the team and unplanned work, an iterative continuous improvement process through retrospectives, and metrics that can quickly show whether the team is on track or not.

Data science work does not fit quite as well into the Scrum approach. I've heard of people advocating for its use, and even at my current team we initially tried to use Scrum for data science, but there are significant challenges. In particular, I like my Scrum teams to break work down to user stories to a size where the effort involved is under two days (ideally closer to half a day). Yes, we use story points, but once the team is calibrated fairly well its still easy to aim for this. Trying to do this for data science work is much harder, especially when it is research work in building new models which is very open-ended.

The approach I have taken with my team is an interesting hybrid that seems to be working quite well and is worth sharing.

Read more…

Basic Machine Learning with SciKit-Learn

This is the fourth post in a series based off my Python for Data Science bootcamp I run at eBay occasionally. The other posts are:

In this post we will look into the basics of building ML models with Scikit-Learn. Scikit-Learn is the most widely used Python library for ML, especially outside of deep learning (where there are several contenders and I recommend using Keras, which is a package that provides a simple API on top of several underlying contenders like TensorFlow and PyTorch).

We'll proceed in this fashion:

  • give a brief overview of key terminology and the ML workflow
  • illustrate the typical use of SciKit-Learn API through some simple examples
  • discuss various metrics that can be used to evaluate ML models
  • dive deeper with some more complex examples
  • look at the various ways we can validate and improve our models
  • discuss the topic of feature engineering - ML models are good examples of "garbage in, garbage out", so cleaning our data and getting the right features is important
  • finally, summarize some of the main model techniques and their pros and cons

Read more…

Exploratory Data Analysis with NumPy and Pandas

This is the third post in a series based off my Python for Data Science bootcamp I run at eBay occasionally. The other posts are:

This is an introduction to the NumPy and Pandas libraries that form the foundation of data science in Python. These libraries, especially Pandas, have a large API surface and many powerful features. There is now way in a short amount of time to cover every topic; in many cases we will just scratch the surface. But after this you should understand the fundamentals, have an idea of the overall scope, and have some pointers for extending your learning as you need more functionality.

Introduction

We'll start by importing the numpy and pandas packages. Note the "as" aliases; it is conventional to use "np" for numpy and "pd" for pandas. If you are using Anaconda Python distribution, as recommended for data science, these packages should already be available:

In [1]:
import numpy as np
import pandas as pd

We are going to do some plotting with the matplotlib and Seaborn packages. We want the plots to appear as cell outputs inline in Jupyter. To do that we need to run this next line:

Read more…

Using Jupyter

This is the second post in a series based off my Python for Data Science bootcamp I run at eBay occasionally. The other posts are:

Jupyter is an interactive computing environment that allows users to create heterogeneous documents called notebooks that can mix executable code, markdown text with MathJax, multimedia, static and interactive charts, and more. A notebook is typically a complete and self-contained record of a computation, and can be converted to various formats and shared with others. Jupyter thus supports a form of literate programming. Several of the posts on this blog, including this one, were written as Jupyter notebooks. Jupyter is an extremely popular tool for doing data science in Python due to its interactive nature, good support for iterative and experimental computation, and ability to create a finished artifact combining both scientific text (with math) and code. It's easiest to start to understand this by looking at an example of a finished notebook.

Jupyter the application combines three components:

Read more…

The 5-Factor Model of Personality

Shankar Vedantam has a great NPR show/podcast, "The Hidden Brain", and occasional appearances on NPR's All Things Considered. In December he had a show on Evaluating Personality Tests. It was enjoyable, especially the Harry Potter Sorting Hat references, but I felt it was a missed opportunity because of the focus on Myers-Briggs, and the fact that he mentioned the Big-5 model only in passing.

In fact, Myers-Briggs is not taken very seriously in the psychology world, and Vedantam surprised me with spending so much time on it, given his show's focus on research in psychology. On the other hand, the Big-5 model is taken quite seriously, with many studies and papers based on it and evaluating it in various contexts (take a look, for example, at the Oxford University Press book I link to at the end of this post).

In the short form NPR segment, this was the section on Big-5 in its entirety:

VEDANTAM: Many personality researchers put greater stock in a test known as the Big Five [vs Myers-Briggs]. Grant says the Big Five has lots of peer reviewed data to back it up.

GRANT: We can predict your job performance, your effectiveness in a team with different collaborators, your likelihood of sticking around in a job versus leaving as well as your probability of your marriage surviving, depending on the personality fit between you and your spouse.

Read more…