Graham Wheeler's Random Forest

Stuff about stuff

Using Jupyter

This is the second post in a series based off my Python for Data Science bootcamp I run at eBay occasionally. The other posts are: a Python crash course exploratory data analysis. introductory machine learning. Jupyter is an interactive computing environment that allows users to create heterogeneous documents called notebooks that can mix executable code, markdown text with MathJax, multimedia, static and interactive charts, and more. A notebook is typically a complete and self-contained record of a computation, and can be converted to various formats and shared with others.

The 5-Factor Model of Personality

Shankar Vedantam has a great NPR show/podcast, “The Hidden Brain”, and occasional appearances on NPR’s All Things Considered. In December he had a show on Evaluating Personality Tests. It was enjoyable, especially the Harry Potter Sorting Hat references, but I felt it was a missed opportunity because of the focus on Myers-Briggs, and the fact that he mentioned the Big-5 model only in passing. In fact, Myers-Briggs is not taken very seriously in the psychology world, and Vedantam surprised me with spending so much time on it, given his show’s focus on research in psychology.

A Python Crash Course

I’ve been teaching a crash course in data science with Python, which starts off with learning Python itself. The target audience is Java programmers (generally senior level) so its assumed that things like classes and methods are well understood. The focus is mostly on what is different with Python. I teach it using Jupyter notebooks but the content is useful as a blog post too so here we go. The other parts are:

Blogging again

Well, it’s been quite a while since I last blogged. My Zite project is not dead; it’s actually up and running well as a personal aggregator but not ready for multi-user access, and I’m not sure when it might be. But I’ve been feeling a bit of an itch to start blogging again so here goes. I have some material already lined up: I’ve been teaching an introductory data science bootcamp at work and thought the notebooks from that could be useful blog posts in and of themselves.

Building a Zite Replacement (Part 11)

It’s been a while since I worked on this but it is still on my mind a lot. I’ve been mulling over ways to improve categorization without the semi-supervised tweaking I’ve had to do. Just to recap, currently this is what I am doing: I have a bunch of ‘category exemplars’, which are sets of key terms associated with a category. These are the things which currently require some manual work; for each article, I extract the plain text, normalize capitalization, remove stop words, then use tf-idf to extract the set of most significant terms (I’m not yet doing stemming although I’ll probably start); I then use a distance metric from the exemplars to assign category scores to the articles.

Using Jupyter as a Music Notebook

I recently started playing guitar again after a long absence and wanted to start making some notes in a digital form. Unfortunately, I didn’t find any good tools. There is TeX of course, which can do anything, but I was hoping for something a bit more WYSIWYGy. There are some very good tools available for musical scores (MuseScore, Frescobaldi), but I want something that is more like a traditional notebook with lots of notes interspersed with occasional musical notation (in both traditional and tablature forms).

Building a Zite Replacement (Part 10)

I’ve spent the past few days refining the web server, largely for diagnostic purposes, so it can replace the old TkInter app. I can seen articles for categories or feeds, their rank, and detailed information on why they received particular categories and ranks. This has enabled me to improve the categorization and ranking algorithms. I’m at a point now where I feel I need a lot more sources than the ~4000 I have right now, as well as more categories.

Building a Zite Replacement (Part 9)

Well, I hope you’ve all brushed your teeth after all that Halloween candy. Today I’m going to show how I build a simple web server to view my feed articles using node.js and Express, along with MongoDB. I have a simple category classifier which finds the best Jacard similarity (described earlier) to a set of category exemplars (i.e. ‘pseudo-articles’ for a category containing just key words that are typical for that category).

Building a Zite Replacement (Part 8)

Happy Halloween, all! I’m sitting here handing out candy and glow necklaces to all comers so its a good time to write a new post. It’s been a while since much happened as I’ve been really busy with the beta release of Google Cloud Datalab, which is my day job. But now that is out and it’s the weekend and lousy weather here in the Pacific Northwest it’s been a good day t get back to things.

Node, npm and Express

Things have been slow on the blogging front but there has been progress on the Zite replacement. I’ll write more about that soon but part of what I have been doing is looking into what server-side technology to use. As far as a database goes, this seems like a no-brainer. I’m dealing with JSON documents that I can either spend some effort on normalizing to put into a SQL database, or simply keep them as is and put them in a database that supports that form, and the obvious choice then is MongoDB, which uses a binary form of JSON.