Summer of Data Science 2017 - Final Update

Ok, so it’s not summer any more. My defence is that I did this work during summer but I’m only writing about it now.

To recap, I’d been working on a smart filter; a system to predict articles I’d like based on articles I’d previously found interesting. I’m calling it my rss-thingy / smart filter / information assistant1. I’m tempted to call it “theia”, short for “the information assistant” and a play on “the AI”, but it sounds too much like a Siri rip-off. Which it’s not.

Aaaanyway, I’d collected 660 interesting articles and 801 that I didn’t find interesting–fewer than expected, but I had to get rid of some that were too short or weren’t articles (e.g., lists of links, or github repositories). There was also a bit of manual work to make sure none of the ‘misses’ were actually ‘hits’. I.e., I didn’t want interesting articles to turn up as misses, so I skimmed through all the misses to make sure they weren’t coincidentally interesting (there were a few). The hits and misses then went into separate folders, ready to be loaded by scikit-learn.

I used scikit-learn to vectorise the documents as a tf-idf matrix, and then trained a linear support vector machine and a naive bayes classifier. Both showed reasonable precision and recall upon my first attempt, but tests on new articles showed that the classifier tended to categorise articles as misses, even if I did find them interesting. This is not particularly surprising; most articles I’m exposed to are not particularly interesting, and such simple models trained on a relatively small dataset are unlikely to be exceptionally accurate in identifying them. I spent a little time tuning the models without getting very far and decided to take a step sideways before going further.

Eventually I’ll want to group potentially interesting articles, so I wrote up a quick topic analysis of the articles I liked, comparing non-negative matrix factorization with latent dirichlet allocation. They did a reasonable job of identifying common themes, including brain research, health research, science, technology, politics, testing, and, of course, data science.

You can see the code for this informal experiment on github.

In my next experiment (now, not SoDS18!) I plan to refine the predictions by paying more attention to cleaning and pre-processing the data. And I need to brush up on tuning these models. I’ll also use the trained models to make ranked predictions rather than simple binary classifications. The dataset will be a little bigger now at around 800 interesting articles, and a few thousand not-so-interesting.

1. Given all the trouble I have naming things, I'm really glad I haven't had to do any cache-invalidation yet.


Summer of Data Science 2017 - Update 1

My dataset/corpus is coming together.

It was relatively easy to create a set of text files from the articles I’d saved to Evernote. It’s taking more time to collect a set of articles that I didn’t find interesting enough to save. I’ll make that easier in the future by automatically saving all the articles that pass through my feed reader, but for now I’m grabbing copies from CommonCrawl. This saves me the trouble of crawling dozens of different websites, but I still have to search the CommonCrawl index to find articles among everything else in the index from each site.

I created a list of all the site I’d saved at least one article from, then I downloaded the CommonCrawl index records for each site from the last two years. Next I filtered the records to include only pages that were likely to be articles (e.g., no ‘about’ or ‘contact’ pages, etc.). I took a random sample of up to 100 of the records remaining for each site and downloaded the WARC records, and then extracted and saved each article’s text. I’ll make all the code available once I’ve polished it a little.

The next step will be to explore the dataset a little before diving into topic analysis.


Summer of Data Science 2017

Goal: To launch* my learn’ed system for coping with the information firehose

I heard about the Summer of Data Science 2017 recently and decided to join in. I like learning by doing so I chose a personal project as the focus of my efforts to enhance my data science skills.

For the past forever I’ve been stopping and starting one side-project in particular. It’s a system that searches and filters many sources of information to provide me with articles/papers/web pages relevant to my interests. It will use NLP and machine learning to model my interests and to predict whether I’m likely to find a new article worthwhile. Like a recommender system but just for me, because I’m selfish. Something like Winds. The idea is to collect all the articles I read/skim/ignore via an RSS reader, and tag those I find interesting. And to build up a Zotero collection of papers of several degrees of quality and interest. Those tagged and untagged articles and papers will comprise my datasets. There is a lot more to this project, but that’s the core of it.

My first (mis)step was to begin building an RSS reader than could automatically gather data on my reading habits that I could use to infer interest based on my behaviour; whether I clicked a link to the full article, how long I spent reading an article, whether I shared it, etc. Recently I decided that was not the best use of my time, as it would be much easier to start with explicitly tagged articles–I can start gathering those without creating a new tool. So I’m doing that by saving interesting articles to Evernote. Today I have just under 900. I can use CommonCrawl to get all the articles I didn’t find interesting on the relevant sites (i.e., the articles that would have appeared in my RSS reader, but which I didn’t save).

There are many things I’ll need to do before I’m done, but all of those depend on having a dataset I can analyse. So my next step will be to turn those Evernote notes and other articles into a dataset suitable for consumption by NLP tools. Given the tools available for transforming text-based datasets from one format to another, I’m not going to spend much time choosing a particular format. I’ll start with a set of plain-text copies of each article and associated metadata, and take it from there.

I’ve been less consistent in gathering research papers. I’ve been saving the best papers I’ve stumbled across, but I could do much better by approaching it as a research project, i.e., do a literature review. That’s a huge task so I’ll focus on analysing web articles first.

*I was going to write "complete" but really, it'll always be changing and will probably never be complete. But ready for automated capture and analysis? Sure, I can make that happen.


Hello, World!

Hello, World!

For a programmer this is a mandatory thing. No apologies.