Graham Wheeler's Random Forest

Stuff about stuff

Building a Zite Replacement (Part 9)

Well, I hope you’ve all brushed your teeth after all that Halloween candy. Today I’m going to show how I build a simple web server to view my feed articles using node.js and Express, along with MongoDB. I have a simple category classifier which finds the best Jacard similarity (described earlier) to a set of category exemplars (i.e. ‘pseudo-articles’ for a category containing just key words that are typical for that category).

Building a Zite Replacement (Part 8)

Happy Halloween, all! I’m sitting here handing out candy and glow necklaces to all comers so its a good time to write a new post. It’s been a while since much happened as I’ve been really busy with the beta release of Google Cloud Datalab, which is my day job. But now that is out and it’s the weekend and lousy weather here in the Pacific Northwest it’s been a good day t get back to things.

Node, npm and Express

Things have been slow on the blogging front but there has been progress on the Zite replacement. I’ll write more about that soon but part of what I have been doing is looking into what server-side technology to use. As far as a database goes, this seems like a no-brainer. I’m dealing with JSON documents that I can either spend some effort on normalizing to put into a SQL database, or simply keep them as is and put them in a database that supports that form, and the obvious choice then is MongoDB, which uses a binary form of JSON.

Building a Zite Replacement (Part 7)

It’s been a while since the last post but I haven’t been idle. Here are some of the things I’ve been up to: tweaking the code to parse content better moving from IPython notebook to a library that I can use to do batch operations as well as interactive exploration modifying the code do do parallel fetches - or more precisely, to operate asynchronously; because of the Python GIL I still have just one thread for now.

Building a Zite Replacement (Part 6)

Following on from last episode, I took some of the clusters that had clear cohesion and made some initial category exemplars. Here are the first few: #!python {"title": "Art", "terms": ["canvas", "painting", "pastels", "sculpture", "gallery", "photography", "landscape", "portrait", "still-life", "exhibition", "sketch"]} {"title": "Literature", "terms": ["novel", "writer", "plot", "character", "author"]} {"title": "Religion", "terms": ["Jesus", "Christianity", "Allah", "Islam", "Judaism", "Sufi", "Hindu", "karma", "sprituality", "faith", "belief", "priest", "pastor", "prayer"]} {"title": "Cooking", "terms": ["ingredients", "bake", "roast", "fry", "stir", "cook", "cooking", "recipe", "flour", "sugar", "butter", "cups", "cup", "teaspoon", "tablespoons", "vanilla"]} Note that these are deliberately in the same format as the articles in articles.

Building a Zite Replacement (Part 5)

My initial experience with clustering was somewhat disappointing. Its clear I need to do some tuning of the approach. The first thing I did was to rerun the article download process, but instead of just keeping the top ten terms and dropping their TF-IDF values, I kept them all. I think there are better ways to select the terms to use for Jacard similarity. For starters, using a fixed number of terms could lead to keeping a wildly different range of TF-IDF values for different articles.

Building a Zite Replacement (Part 4)

Following my last post, I started gathering URLs of feeds to use for sample data. First I scraped the links that I had saved in Pocket (a scarily large number). It didn’t seem like Pocket had an easy way to export this, so I loaded up Pocket in Chrome, scrolled and scrolled and scrolled until I could scroll no more, then saved the resulting web page once it was done loading.

Building a Zite Replacement (Part 3)

Since yesterdays post on term extraction, I’ve made a few tweaks. In particular I only adjust capitalization on the first words of sentences, I’m keeping numbers and hyphenation, and if there are consecutive capitalized words I turn them into single terms. For example, the terms for the Donald Trump on vaccines article have changed from: vaccines Donald Trump children doses effective vaccinations diseases Carson debate to: vaccines children Donald Trump doses effective vaccinations diseases smaller vaccination debate babies autism cause schedule studies I’m not sure why ‘Carson’ was dropped; it’s possible that the text of the article changed between the two runs.

Building a Zite Replacement (Part 2)

In the previous post I gave an overview of what needs to be built for our Zite replacement. In this post we will look at how to load an RSS feed and generate key terms for each article. In order to fetch the feed we will make use of the feedparser package, so make sure to install that first with pip, conda, or whatever you use. Another thing we’re going to want is to strip HTML tags from the articles.

Building a Zite Replacement (Part 1)

The two most used apps on my phone are Zite and Pocket. Unfortunately last year Zite was bought by Flipboard and has slowly been getting worse. Recently the top sticky article on Zite has been a post on migrating your preferences to Flipboard, but suggests Zite is not much longer for this world. This would be okay if Flipboard was a suitable replacement, but it isn’t. It’s very flashy (which I don’t like), and just doesn’t seem to get things right when it comes to serendipitous discovery of interesting content.