Mark Lapierre

Software, test, and automation engineer; cognitive scientist; data science aspirant; perennial student.

We're neither rock stars nor impostors

Recently, Rach Smith raised some important points about how we tend to talk about impostor syndrome:

it minimizes the impact that this experience has on people that really do suffer from it.

we’re labelling what should be considered positive personality traits - humility, an acceptance that we can’t be right all the time, a desire to know more, as a “syndrome” that we need to “deal with”, “get over” or “get past”.

If you haven’t read her post yet I highly recommend you do. The issue came up again during Rach’s chat with Dave on Developer on Fire.

I can’t truly say I’ve experienced impostor syndrome, although I suspect that’s mostly because I’ve often been in small teams where everyone was similarly skilled. For example, I was once one of two novice web developers in a product development team. We really didn’t know what we were doing. I did feel unqualified, but since there was no one more experienced to compare myself against I didn’t feel like an impostor. But I did suffer from low self-confidence and a huge pile of self-doubt. Fortunately, experience and education has helped me come to grips with the limits of my knowledge and ability. I’m sure that self-awareness has contributed to better performance independently of any increase in my skills.

It all got me thinking about my experience with how jobs are advertised and how interviews are conducted, about the pressure to elevate one’s technical skills, about the growing awareness of the importance of “soft” skills, and about the rock star culture that’s promoted in some parts of the industry.

Rach noted that even highly successful senior developers sometimes experience self-doubt and the awareness of gaps in their knowledge. This is something that is all too often missing from discussions about preparing for interviews, especially for highly sought-after positions. We’re always told to prepare extensively (good advice), and to project confidence (sure, projecting a lack of confidence is understandably unhelpful), but the highest quality advice also points out the importance of awareness of the limits of one’s skills and knowledge so that they can be appropriately managed. Much of the advice I remember from my early days suggested I should do my best to cover up my weaknesses. I don’t believe that did anything but lead to feelings of insecurity and inevitably falling apart when the limits of my knowledge were revealed. Later, I received much better advice; to be able to say “I don’t know,” and then to work through the problem aloud, asking questions to fill in the gaps until I do have enough understanding to give a reasonable answer. And isn’t that more or less how we work each day? If anyone actually had the supreme skills and confidence we’re naively advised to portray during interviews, I’m pretty sure they wouldn’t find the job challenging or interesting enough (and would likely inflict their arrogance and the consequences of their boredom on the rest of us).

Another topic missing from good career advice, fortunately less common these days, is the importance of soft skills. As Rach noted, “the most accomplished developers [have] constant awareness of the ‘gap’ in their knowledge and willingness to work towards closing it.” That sort of awareness is as important a soft skill as general social and communication skills. It’s a key part of metacognition. The people I’ve experienced most joy in working with are those who freely admit their limitations and strive daily towards eliminating them. That effort shows in their contributions at work that go above and beyond the explicit requirements of their role. Among the worst people to work with are those who do the minimum work required, without any awareness of the opportunities for improvement that pass them by every day. Even worse are those who perform at a similar level while believing that they are in fact contributing much more and at a much greater degree of competence¹. The latter type of person is unlikely to experience anything that might be called “impostor syndrome”, although if anyone were truly an impostor, it would be them.

Beyond a growing understanding of the importance of interpersonal soft skills, there are many other non-technical skills that make a solid team member. For example, the O*NET database shows active learning towards the top of a list of skills seen as important for a programmer². And yet typical hiring practices overwhelmingly reflect the prioritisation of immediate technical skills. I’m confident that’s a big part of the reason “rock star” developers are those seen as having the greatest skills rather than being most able to learn or improve. And yet the former doesn’t imply the latter, especially if those great skills lie in one highly specific domain; you can learn to do one thing really well without being able to generalise that skill, nor does it mean you possess other distinct but important skills. Other downsides of specialisation are a topic for another post.

Similarly, the poor attitudes and bad behaviours of some workers are accepted because of their technical skills, despite the negative impact they have on the people around them. I suspect this might be a subtle influence on feeling like an imposter; we provide a perverse incentive for people to behave in ways that no reasonable person wants to. Our industry favours those who promote themselves as the best coder, the most knowledgeable developer, the ideal technical candidate, and we (at least implicitly) discourage people from embracing their range of skills and their ability to improve.

1. The Dunning-Kruger effect in effect, so to speak.

2. Although communication skills are apparently the #1 requirement in computing-related job ads, other soft skills and transferable technical skills are far less frequently mentioned.

Summer of Data Science 2017 - Final Update

Ok, so it’s not summer any more. My defence is that I did this work during summer but I’m only writing about it now.

To recap, I’d been working on a smart filter; a system to predict articles I’d like based on articles I’d previously found interesting. I’m calling it my rss-thingy / smart filter / information assistant¹. I’m tempted to call it “theia”, short for “the information assistant” and a play on “the AI”, but it sounds too much like a Siri rip-off. Which it’s not.

Aaaanyway, I’d collected 660 interesting articles and 801 that I didn’t find interesting–fewer than expected, but I had to get rid of some that were too short or weren’t articles (e.g., lists of links, or github repositories). There was also a bit of manual work to make sure none of the ‘misses’ were actually ‘hits’. I.e., I didn’t want interesting articles to turn up as misses, so I skimmed through all the misses to make sure they weren’t coincidentally interesting (there were a few). The hits and misses then went into separate folders, ready to be loaded by scikit-learn.

I used scikit-learn to vectorise the documents as a tf-idf matrix, and then trained a linear support vector machine and a naive bayes classifier. Both showed reasonable precision and recall upon my first attempt, but tests on new articles showed that the classifier tended to categorise articles as misses, even if I did find them interesting. This is not particularly surprising; most articles I’m exposed to are not particularly interesting, and such simple models trained on a relatively small dataset are unlikely to be exceptionally accurate in identifying them. I spent a little time tuning the models without getting very far and decided to take a step sideways before going further.

Eventually I’ll want to group potentially interesting articles, so I wrote up a quick topic analysis of the articles I liked, comparing non-negative matrix factorization with latent dirichlet allocation. They did a reasonable job of identifying common themes, including brain research, health research, science, technology, politics, testing, and, of course, data science.

You can see the code for this informal experiment on github.

In my next experiment (now, not SoDS18!) I plan to refine the predictions by paying more attention to cleaning and pre-processing the data. And I need to brush up on tuning these models. I’ll also use the trained models to make ranked predictions rather than simple binary classifications. The dataset will be a little bigger now at around 800 interesting articles, and a few thousand not-so-interesting.

1. Given all the trouble I have naming things, I'm really glad I haven't had to do any cache-invalidation yet.

Summer of Data Science 2017 - Update 1

My dataset/corpus is coming together.

It was relatively easy to create a set of text files from the articles I’d saved to Evernote. It’s taking more time to collect a set of articles that I didn’t find interesting enough to save. I’ll make that easier in the future by automatically saving all the articles that pass through my feed reader, but for now I’m grabbing copies from CommonCrawl. This saves me the trouble of crawling dozens of different websites, but I still have to search the CommonCrawl index to find articles among everything else in the index from each site.

I created a list of all the site I’d saved at least one article from, then I downloaded the CommonCrawl index records for each site from the last two years. Next I filtered the records to include only pages that were likely to be articles (e.g., no ‘about’ or ‘contact’ pages, etc.). I took a random sample of up to 100 of the records remaining for each site and downloaded the WARC records, and then extracted and saved each article’s text. I’ll make all the code available once I’ve polished it a little.

The next step will be to explore the dataset a little before diving into topic analysis.

Summer of Data Science 2017

Goal: To launch^* my learn’ed system for coping with the information firehose

I heard about the Summer of Data Science 2017 recently and decided to join in. I like learning by doing so I chose a personal project as the focus of my efforts to enhance my data science skills.

For the past forever I’ve been stopping and starting one side-project in particular. It’s a system that searches and filters many sources of information to provide me with articles/papers/web pages relevant to my interests. It will use NLP and machine learning to model my interests and to predict whether I’m likely to find a new article worthwhile. Like a recommender system but just for me, because I’m selfish. Something like Winds. The idea is to collect all the articles I read/skim/ignore via an RSS reader, and tag those I find interesting. And to build up a Zotero collection of papers of several degrees of quality and interest. Those tagged and untagged articles and papers will comprise my datasets. There is a lot more to this project, but that’s the core of it.

My first (mis)step was to begin building an RSS reader than could automatically gather data on my reading habits that I could use to infer interest based on my behaviour; whether I clicked a link to the full article, how long I spent reading an article, whether I shared it, etc. Recently I decided that was not the best use of my time, as it would be much easier to start with explicitly tagged articles–I can start gathering those without creating a new tool. So I’m doing that by saving interesting articles to Evernote. Today I have just under 900. I can use CommonCrawl to get all the articles I didn’t find interesting on the relevant sites (i.e., the articles that would have appeared in my RSS reader, but which I didn’t save).

There are many things I’ll need to do before I’m done, but all of those depend on having a dataset I can analyse. So my next step will be to turn those Evernote notes and other articles into a dataset suitable for consumption by NLP tools. Given the tools available for transforming text-based datasets from one format to another, I’m not going to spend much time choosing a particular format. I’ll start with a set of plain-text copies of each article and associated metadata, and take it from there.

I’ve been less consistent in gathering research papers. I’ve been saving the best papers I’ve stumbled across, but I could do much better by approaching it as a research project, i.e., do a literature review. That’s a huge task so I’ll focus on analysing web articles first.

*I was going to write "complete" but really, it'll always be changing and will probably never be complete. But ready for automated capture and analysis? Sure, I can make that happen.

Hello, World!

For a programmer this is a mandatory thing. No apologies.

Mark Lapierre

Goal: To launch* my learn’ed system for coping with the information firehose

Goal: To launch^* my learn’ed system for coping with the information firehose