Is your A/B testing effort just chasing statistical ghosts?

From Twitter: @matseinarsen June 17th, 2012 § 18 comments § permalink

I’ve always felt that the idea of repeated significance testing error and false positive rates is a bit of a pedantic academic exercise.  And I’m not the only one, some A/B frameworks let you automatically stop or conclude at the moment of significance, and there’s is blessed little discussion of false positive rates online. For anyone running A/B tests it’s also little incentive to control your false positives. Why make it harder for yourself to show successful changes, just to meet some standard no-one cares about anyways?

It’s not that easy. Because it actually matters, and matters a lot if you care about your A/B experiments, and not the least about what you learn from them. Evan Miller has written a thorough article on the subject in How Not To Run An A/B Test, but it’s quite too advanced to illustrate the effect very well. To demonstrate how much it matters, I’ve ran a simulation of how much impact you should expect repeat testing errors to have on your success rate.

Here’s how the simulation works:

  • It runs 1.000 experiments, each with 200.000 fake participants divided randomly into two experiment variants.
  • The conversion rate is 3% in both variants.
  • Each individual “participant” gets randomly assigned to a variant and either the “hit” or “miss” group based on the conversion rate.
  • After each participant, a g-test type significance test is run, testing if the distribution is different between the two variants.
  • I then count every occasion where an experiment did hit significance at 90% and 95% probability, then count every experiment that did reach significance at any point.
  • As the g-test doesn’t like low numbers, I didn’t check significance during first 1.000 participants in each experiment.
  • You can download the script and alter the variables to fit your metrics.

So what’s the outcome?  Keep in mind that these are 1.000 controlled experiment where it’s known that there are no difference between the variants.

  • 771 experiments out of 1.000 reached 90% significance at some point
  • 531 experiments out of 1.000 reached 95% significance at some point

This means if you’ve run 1.000 experiments and didn’t control for repeat testing error in any way, a rate of successful positive experiments up to 25% might be explained by a false positive rate. But you’ll see a temporary significant effect in around half of your experiments!

Fortunately, there’s an easy fix. Select your sample size or decision point in advance, and make your decision then. These are the false error rates when making the decision only at the end of the experiment:

  • 100 experiments out of 1.000 were significant at 90%
  • 51 experiments out of 1.000 were significant at 95%

So you still get a false positive rate you should not ignore, but nowhere near as serious as when you don’t control correctly. And this is what you should expect when running with significance levels like this – this is actually the probability level of 95% you would expect, and at this point you can talk about real hypothesis testing.



Amazon recommendations

From Twitter: @matseinarsen October 30th, 2011 § 0 comments § permalink

It’s almost 10 years old, but this is an excellent article from Greg Linden, Brent Smith and Jeremy York on how does product recommendations: Recommendations – Item-to-Item Collaborative Filtering

Going to OSCON?

From Twitter: @matseinarsen July 24th, 2011 § 0 comments § permalink

Interested in discussing psychology and software development? I’m at the OSCON in Portland, Oregon all week this and I would be really interested to chat with others interested in psychology.

I’m mainly at the conference to help hiring for so come and ask for me at the stand in the Expo hall.

And if you’re interested in any of the many positions we’re looking to fill also drop by, of course! Have a look at our available openings at the jobs portal. We’re still trying to get hold of many, many experienced Perl developers, and we’re also willing to teach highly experienced developers in other languages Perl.

What makes a superstar developer?

From Twitter: @matseinarsen June 22nd, 2011 § 0 comments § permalink

A funny discussion is going on at HBR Blogs:  Management-type blogger Bill Taylor suggests our culture wrongly celebrates the super-stars, and claims great people are overrated, on the cost of well-functioning teams (via Igor Sutton). But, to illustrate his example he uses software engineers as an example. Cue outpouring of frustration – Bill’s getting hammered in the comment section.

So what’s the problem?

First, try to look past that Bill skipped 30 years of research and experience in software engineering and stamps into it like a PHB-cliche, seemingly assuming his opinion is as valid as any research on the subject. That alone probably ticked off the defensive reaction in any software developer accidentally stumbling into the Harvard Business Review blog section.

The main misunderstanding is the assumption that there is only one type of talent. And both Taylor and a fair amount of the commenters makes this mistake and applies their experience with basic bell-curve measured skills onto a type of talent that is ruled by other laws.  Nassim Nicholas Taleb discusses this distinction in great detail in The Black Swanthe environments of “mediocristan” and “extremistan”. The first which you can understand using the bell curve and gaussian distribution, the latter where differences are of an order of magnitude and are qualitative or disruptive differences.

For example, in most manual labour or production line work, a practitioner can get good or even excellent, but mostly within a range that can be measured safely with standard deviations and the differences between average and excellent performance does not differ in it’s nature, but it’s output. Here throwing more people at the problem can solve it – maybe your top salesman makes ten sales a week while your average makes 4. Well, throw in 3 averages and you’re just as good. That’s not necessarily good economy or a good idea, but it can get you where you want.

In contrast, in the world of “extremistan”, a difference in skill can be of such a magnitude that it makes a qualitative difference. In the comment section of Taylor’s article, someone asks, “Would you want a Shakespeare or 100 Bill Taylor’s?”, and countless variations on the theme. And software engineering is that sort of talent. A “superstar developer” doesn’t necessarily a programming Shakespeare make, but he or she can make something a lesser qualified individual can never do, or do it so fast it makes the difference between staying in competition or not. Or just connect the dots and save the day.

Throw in the special problem of software engineering that putting one more person on a task tends to double the time it takes to solve it, and the effect is even larger.

But then, the Bill Taylor’s take their experiences with the “mediocristan” type of talent and applies it on the very different world of software engineering. It comes out, as is well pointed out in the comment section, as commoditization of something that can not be commoditized. Software development can’t be reduced to the number of lines written per day, in any meaningful way. Even business people who really, really want to think about it in that way, will still be wrong. It has been discussed and demonstrated countless times. The classic “The Mythical Man-Month” explains how you can’t just think of your software engineers as burger flippers.

When the original blog post received so much vitriol, it comes from every software engineer in the world’s experience with clueless managers who approach development in the Bill Taylor way. Today’s successful companies are ran in a different way, by Zuckerberg’s and Steve Jobs’ who are playing in a the-winner-takes-it-all world of the Internet. It doesn’t take that many, as long as you have the right people – Facebook has 2000 employees serving over 600 million users.

So what makes a super star developer?

I believe in the old truism that the difference between developers can be of the 100x magnitude – I actually think the Net Negative Production Programmer is a reality, so accordingly a top developer is infinitely better than the worst… However, unless you think the superstars have their skills handed down to them by divine intervention, something must have brought them to this. Here’s some points I believe are what makes up the super star:

  • It’s knowing the codebase well. The superstar is often the guy who knows the codebase to the last semi-colon. It’s not about being the smartest guy in the room, but just knowing exactly where to hack in that little change, and that little change, and that little change, and that little change…. before lunch.
  • It’s content knowledge. The superstar is the guy who also knows the content of what he is coding really well. If you’re making a chess bot, the developer who is also a chess grandmaster knows all the elements, the edge cases and the purpose of what he is trying to do. He will have a working prototype running while the developer without any chess background is still trying to understand all the implicit assumptions in the instructions he got from his Scrum Product Owner. I think this one is essential, and so often overlooked. Knowing the thing you work with, and working with what you find interesting is not only making development work into a whole other ballgame, it’s also does wonders for motivation.
  • It’s situational. You can’t throw Linus Torvalds, Larry Wall or Bill Joy into your hack app shop and expect to have the next Angry Birds in a month. You need the right person and the right setting.
  • It’s knowing programming well. The superstar developer doesn’t have to be a superstar in the inner working of his programming language – but he knows it well enough that it’s not in the way of him reaching his ultimate goal.
  • It’s practice, practice, practice. Yeah exactly! No one is born into superstardom. This is written about a lot, and Malcolm Gladwell’s thesis of the 10.000 hours of practice rule really applies the software engineering too. One complicating factor with this point is, as I wrote about in accelerating your Perl learning, that most developers are on a lifelong learning mission anyways. They (we…) always look to learn something new, so what puts the superstar apart from the average? It’s not that easy to pick out, but in his article about the 10.000 hours of practice, Gladwell touches upon the differences of training and learning. It’s a large field of research and I hope to post more about it later.
  • It’s a high level of intelligence. But not necessarily an extreme level of intelligence. However, a certain level of numeracy and ability of thinking in abstractions is necessary.  Maybe at some point we’ll find that super developers are able to hold more variables at the same time in their working memory, or something along those lines, and that turns out to explain the differences. But I have yet to see any research like that.

And it’s motivation… and the right management… but these things are also things external to the superstar, that you one move around to find. The brain finds it’s ideal setting..

So there, that’s what put the superstars apart from the average. I’d love to hear others’ take on it.

The sad thing is…

..that reality is that a well functioning team can usually match the lone superstar or even has it’s own advantages. It’s actually a good message: let’s not just celebrate these few people, they’re brought forward on the shoulders of people going through the daily grind of facilitating it and cleaning up the mess. Or we can achieve great things just by working together. It’s just presented so boneheadedly wrong.

Finally, Bill followed up with a clarification of some sorts, basically that IBM is made up of average people and see how long they have lasted, while Enron was full of superstars, and look how they crashed.  I wonder what IBM thinks of his base assumption there.

Can you detect user emotion with only mouse movements?

From Twitter: @matseinarsen June 7th, 2011 § 0 comments § permalink

Trying to learn more about how emotion affects ecommerce, I came across the book “eMotion: Estimation of User’s Emotional State by Mouse Motions” by Wolfgang Maehr.  Basically, Wolfgang Maehr found that you can correlate certain types of mouse movements with emotional states.  Specifically, he found that mouse acceleration, deceleration, speed and uniformity could predict arousal, disgust/delight, and anger/contendedness, all in a sample of 39 participants.

But… how is this not available to me in a handy javascript library?   I am just dreaming of reading off the emotional state of website visitors per page.  Or per blog post for that matter…

If you know of anyone who has made any implementation of something like this, please please leave a comment!


Full research paper with numbers here: eMotionEstimation of the User’s Emotional State by Mouse Motions.


List of lists of cognitive biases

From Twitter: @matseinarsen June 5th, 2011 § 0 comments § permalink

I just want to share these very cool lists of cognitive biases.  It’s so useful to just have an overview of these on hand – and obviously I’m not the only one thinking so, as there are several useful collections out there:

For the uninitiated, cognitive biases are identified tendencies in human decision making, or as wikipedia defines it “a pattern of deviation in judgment that occurs in particular situation”.


Programming well with others: Social Skills for Geeks

From Twitter: @matseinarsen May 20th, 2011 § 0 comments § permalink


Long time no posting, but this just had to go on here.

Six steps to excellence

From Twitter: @matseinarsen August 29th, 2010 § 0 comments § permalink

Tony Schwartz/Harvard Business Review has an interesting bullet point list of what is necessary to excel in any field: Six Keys to Being Excellent at Anything

It’s based on Anders Ericsson’s work in the field, and holds as well for computer programmers as practitioners in any other field.

See also: Accelerate your Perl learning

Psychology talk from YAPC::NA 2010 online

From Twitter: @matseinarsen August 29th, 2010 § 2 comments § permalink

The video recording of my talk from YAPC::NA on the Psychology of Perl is online.  It has a very funny beginning when Tatsuhiko Miyagawa walks into the room receiving standing ovations as I start my talk, which is really weird in the video. Still made for a fun start of the talk…

I have to admit I haven’t watched the whole video myself, but word around is that people liked it. Which is motivating for putting together a larger, more detailed talk for a smaller interested audience, rather than a quick overview for a generally less-than-interested audience.

Psychology of Perl talk links

From Twitter: @matseinarsen June 24th, 2010 § 1 comment § permalink

Wow, I managed to sneak in a lightning talk about the Psychology of Programming, with a Perl twist, at the YAPC::NA 2010 conference. Very fun – it was my first ever conference talk, and I could certainly work a bit on the style, but it got some people thinking and talking, and that’s a great response.

Someone requested that I post the slides so he could get the url’s I referenced. I think there was too many copyrighted images in the slides for me to put them online, but I’ll post the links for reference:

Working memory limitations: Oberauer & Risse (2010), Selection of objects and tasks in working memory, The Quarterly Journal of Experimental Psychology, vol 63 (4), 784-804.

Object Factory Pattern: Update on the Natural Programming Project

Data-driven programming: The Evidence Based Software Engineering database

Also, after my talk someone notified me about the interesting blog Psychology of Video Games

And finally: A million thanks to the people who gave me feedback on the talk!