Graham Wheeler's Random Forest

Stuff about stuff

Github reports for backlog management

A couple of years ago we decided we wanted to make sure that we were responding timeously to issues that users created on our GitHub repos. In particular, a 3-business day SLA was what we thought would be appropriate. Making sure that we did that day after day could be a bit tedious, so I thought it made sense to automate it. We had a vast trove of GitHub data in our Kusto data warehouse for every repository owned by Microsoft.

Moving my blog to Hugo

I have been using Nikola for about the past 8 years for my blog, but have been eyeing the development of Hugo and thinking I might want to migrate, and have finally done it. There’s nothing wrong with Nikola; I think it’s actually less work than Hugo because it handles .ipynb Jupyter notebooks very seamlessly, but Hugo is super-fast so you can work in a ’live-releoad’ mode which I like. So this weekend I finally did it.

Unit Tests that Don't Suck

Introduction This post is based on a talk I gave to my team in an effort to establish a common approach to thinking about unit tests. The existing code base we had suffered from a number of problems relating to how tests were being written; despite good intentions, it can be easy to do testing badly. In particular, here are some of the things I observed: a massive overuse of dependency injection: pretty much all dependencies of all classes were being set up using DI.

Prioritization, Estimating and Planning

This post came out of a talk I gave to a group of mentees, prompted by questions they had around how to do estimation and how to know they were working on the right priorities. These are complex questions to which there are no single answers, but I aimed to give them some tools that could help. Prioritizing “If it’s a priority you’ll find a way. If it isn’t, you’ll find an excuse.

Flow

“A bad system will beat a good person every time” - Edwards Deming This post is based on a tech talk I gave at eBay in early 2018. eBay had gone through a company-wide transformation to agile processes (where before this had been team-specific) and the main points I wanted to make where that it was important to make the hidden things the consumed people’s time visible, explicit, and properly prioritized, if we want to improve throughput or flow.

Personality Patterns

The last post in this series covered the Five Factor Model of personality. In this post we’ll dig into personality patterns that people can exhibit. Everyone has some combination of the five factors, but how does that combination manifest as a personality type? There are many different models of personality types, but one used in psychology and psychoanalysis is the categorization in the DSM - the Diagnostic and Statistical Manual of Mental Disorders.

The "Tyranny" of Metrics

Jerry Muller recently wrote a popular book titled “The Tyranny of Metrics”. He makes a number of good arguments for why metrics, if not used properly, can have unintended consequences. For example, the body count metric that the US military optimized for in the Vietnam war caused enormous damage while losing the hearts and minds of the populace and resulting in an ignominious defeat. Muller argues that metrics are too often used as a substitute for good judgment.

Managing Engineering and Data Science Agile Teams

It is very common in modern software engineering organizations to use agile approaches to managing teamwork. At both Microsoft and eBay teams I have managed have used Scrum, which is a reasonably simple and effective approach that offers a number of benefits, such as timeboxing, regular deployments (not necessarily continuous but at least periodic), a buffer between the team and unplanned work, an iterative continuous improvement process through retrospectives, and metrics that can quickly show whether the team is on track or not.

Basic Machine Learning with SciKit-Learn

This is the fourth post in a series based off my [Python for Data Science bootcamp]((https://github.com/gramster/pythonbootcamp) I run at eBay occasionally. The other posts are: a Python crash course using Jupyter exploratory data analysis. In this post we will look into the basics of building ML models with Scikit-Learn. Scikit-Learn is the most widely used Python library for ML, especially outside of deep learning (where there are several contenders and I recommend using Keras, which is a package that provides a simple API on top of several underlying contenders like TensorFlow and PyTorch).

Exploratory Data Analysis with NumPy and Pandas

This is the third post in a series based off my Python for Data Science bootcamp I run at eBay occasionally. The other posts are: a Python crash course using Jupyter introductory machine learning. This is an introduction to the NumPy and Pandas libraries that form the foundation of data science in Python. These libraries, especially Pandas, have a large API surface and many powerful features. There is now way in a short amount of time to cover every topic; in many cases we will just scratch the surface.