Graham Wheeler's Random Forest

Stuff about stuff

The "Tyranny" of Metrics

Jerry Muller recently wrote a popular book titled “The Tyranny of Metrics”. He makes a number of good arguments for why metrics, if not used properly, can have unintended consequences. For example, the body count metric that the US military optimized for in the Vietnam war caused enormous damage while losing the hearts and minds of the populace and resulting in an ignominious defeat. Muller argues that metrics are too often used as a substitute for good judgment.

Managing Engineering and Data Science Agile Teams

It is very common in modern software engineering organizations to use agile approaches to managing teamwork. At both Microsoft and eBay teams I have managed have used Scrum, which is a reasonably simple and effective approach that offers a number of benefits, such as timeboxing, regular deployments (not necessarily continuous but at least periodic), a buffer between the team and unplanned work, an iterative continuous improvement process through retrospectives, and metrics that can quickly show whether the team is on track or not.

Basic Machine Learning with SciKit-Learn

This is the fourth post in a series based off my [Python for Data Science bootcamp]((https://github.com/gramster/pythonbootcamp) I run at eBay occasionally. The other posts are: a Python crash course using Jupyter exploratory data analysis. In this post we will look into the basics of building ML models with Scikit-Learn. Scikit-Learn is the most widely used Python library for ML, especially outside of deep learning (where there are several contenders and I recommend using Keras, which is a package that provides a simple API on top of several underlying contenders like TensorFlow and PyTorch).

Exploratory Data Analysis with NumPy and Pandas

This is the third post in a series based off my Python for Data Science bootcamp I run at eBay occasionally. The other posts are: a Python crash course using Jupyter introductory machine learning. This is an introduction to the NumPy and Pandas libraries that form the foundation of data science in Python. These libraries, especially Pandas, have a large API surface and many powerful features. There is now way in a short amount of time to cover every topic; in many cases we will just scratch the surface.

Using Jupyter

This is the second post in a series based off my Python for Data Science bootcamp I run at eBay occasionally. The other posts are: a Python crash course exploratory data analysis. introductory machine learning. Jupyter is an interactive computing environment that allows users to create heterogeneous documents called notebooks that can mix executable code, markdown text with MathJax, multimedia, static and interactive charts, and more. A notebook is typically a complete and self-contained record of a computation, and can be converted to various formats and shared with others.

The 5-Factor Model of Personality

Shankar Vedantam has a great NPR show/podcast, “The Hidden Brain”, and occasional appearances on NPR’s All Things Considered. In December he had a show on Evaluating Personality Tests. It was enjoyable, especially the Harry Potter Sorting Hat references, but I felt it was a missed opportunity because of the focus on Myers-Briggs, and the fact that he mentioned the Big-5 model only in passing. In fact, Myers-Briggs is not taken very seriously in the psychology world, and Vedantam surprised me with spending so much time on it, given his show’s focus on research in psychology.

A Python Crash Course

I’ve been teaching a crash course in data science with Python, which starts off with learning Python itself. The target audience is Java programmers (generally senior level) so its assumed that things like classes and methods are well understood. The focus is mostly on what is different with Python. I teach it using Jupyter notebooks but the content is useful as a blog post too so here we go. The other parts are:

Blogging again

Well, it’s been quite a while since I last blogged. My Zite project is not dead; it’s actually up and running well as a personal aggregator but not ready for multi-user access, and I’m not sure when it might be. But I’ve been feeling a bit of an itch to start blogging again so here goes. I have some material already lined up: I’ve been teaching an introductory data science bootcamp at work and thought the notebooks from that could be useful blog posts in and of themselves.

Building a Zite Replacement (Part 11)

It’s been a while since I worked on this but it is still on my mind a lot. I’ve been mulling over ways to improve categorization without the semi-supervised tweaking I’ve had to do. Just to recap, currently this is what I am doing: I have a bunch of ‘category exemplars’, which are sets of key terms associated with a category. These are the things which currently require some manual work; for each article, I extract the plain text, normalize capitalization, remove stop words, then use tf-idf to extract the set of most significant terms (I’m not yet doing stemming although I’ll probably start); I then use a distance metric from the exemplars to assign category scores to the articles.

Using Jupyter as a Music Notebook

I recently started playing guitar again after a long absence and wanted to start making some notes in a digital form. Unfortunately, I didn’t find any good tools. There is TeX of course, which can do anything, but I was hoping for something a bit more WYSIWYGy. There are some very good tools available for musical scores (MuseScore, Frescobaldi), but I want something that is more like a traditional notebook with lots of notes interspersed with occasional musical notation (in both traditional and tablature forms).