Graham Wheeler's Random Forest

Stuff about stuff

Using Jupyter as a Music Notebook

I recently started playing guitar again after a long absence and wanted to start making some notes in a digital form. Unfortunately, I didn’t find any good tools. There is TeX of course, which can do anything, but I was hoping for something a bit more WYSIWYGy. There are some very good tools available for musical scores (MuseScore, Frescobaldi), but I want something that is more like a traditional notebook with lots of notes interspersed with occasional musical notation (in both traditional and tablature forms).

Building a Zite Replacement (Part 10)

I’ve spent the past few days refining the web server, largely for diagnostic purposes, so it can replace the old TkInter app. I can seen articles for categories or feeds, their rank, and detailed information on why they received particular categories and ranks. This has enabled me to improve the categorization and ranking algorithms. I’m at a point now where I feel I need a lot more sources than the ~4000 I have right now, as well as more categories.

Building a Zite Replacement (Part 9)

Well, I hope you’ve all brushed your teeth after all that Halloween candy. Today I’m going to show how I build a simple web server to view my feed articles using node.js and Express, along with MongoDB. I have a simple category classifier which finds the best Jacard similarity (described earlier) to a set of category exemplars (i.e. ‘pseudo-articles’ for a category containing just key words that are typical for that category).

Building a Zite Replacement (Part 8)

Happy Halloween, all! I’m sitting here handing out candy and glow necklaces to all comers so its a good time to write a new post. It’s been a while since much happened as I’ve been really busy with the beta release of Google Cloud Datalab, which is my day job. But now that is out and it’s the weekend and lousy weather here in the Pacific Northwest it’s been a good day t get back to things.

Node, npm and Express

Things have been slow on the blogging front but there has been progress on the Zite replacement. I’ll write more about that soon but part of what I have been doing is looking into what server-side technology to use. As far as a database goes, this seems like a no-brainer. I’m dealing with JSON documents that I can either spend some effort on normalizing to put into a SQL database, or simply keep them as is and put them in a database that supports that form, and the obvious choice then is MongoDB, which uses a binary form of JSON.

Building a Zite Replacement (Part 7)

It’s been a while since the last post but I haven’t been idle. Here are some of the things I’ve been up to: tweaking the code to parse content better moving from IPython notebook to a library that I can use to do batch operations as well as interactive exploration modifying the code do do parallel fetches - or more precisely, to operate asynchronously; because of the Python GIL I still have just one thread for now.

Building a Zite Replacement (Part 6)

Following on from last episode, I took some of the clusters that had clear cohesion and made some initial category exemplars. Here are the first few: #!python {"title": "Art", "terms": ["canvas", "painting", "pastels", "sculpture", "gallery", "photography", "landscape", "portrait", "still-life", "exhibition", "sketch"]} {"title": "Literature", "terms": ["novel", "writer", "plot", "character", "author"]} {"title": "Religion", "terms": ["Jesus", "Christianity", "Allah", "Islam", "Judaism", "Sufi", "Hindu", "karma", "sprituality", "faith", "belief", "priest", "pastor", "prayer"]} {"title": "Cooking", "terms": ["ingredients", "bake", "roast", "fry", "stir", "cook", "cooking", "recipe", "flour", "sugar", "butter", "cups", "cup", "teaspoon", "tablespoons", "vanilla"]} Note that these are deliberately in the same format as the articles in articles.

Building a Zite Replacement (Part 5)

My initial experience with clustering was somewhat disappointing. Its clear I need to do some tuning of the approach. The first thing I did was to rerun the article download process, but instead of just keeping the top ten terms and dropping their TF-IDF values, I kept them all. I think there are better ways to select the terms to use for Jacard similarity. For starters, using a fixed number of terms could lead to keeping a wildly different range of TF-IDF values for different articles.

Building a Zite Replacement (Part 4)

Following my last post, I started gathering URLs of feeds to use for sample data. First I scraped the links that I had saved in Pocket (a scarily large number). It didn’t seem like Pocket had an easy way to export this, so I loaded up Pocket in Chrome, scrolled and scrolled and scrolled until I could scroll no more, then saved the resulting web page once it was done loading.

Building a Zite Replacement (Part 3)

Since yesterdays post on term extraction, I’ve made a few tweaks. In particular I only adjust capitalization on the first words of sentences, I’m keeping numbers and hyphenation, and if there are consecutive capitalized words I turn them into single terms. For example, the terms for the Donald Trump on vaccines article have changed from: vaccines Donald Trump children doses effective vaccinations diseases Carson debate to: vaccines children Donald Trump doses effective vaccinations diseases smaller vaccination debate babies autism cause schedule studies I’m not sure why ‘Carson’ was dropped; it’s possible that the text of the article changed between the two runs.