Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I think it is too early to judge. I've never done A/B testing, but I think the point is to do it continously; not for a week and decide it is stupid.

Let's say Patrick would continue this for 9 more weeks; so far we've seen 10% of the whole test. Let me illustrate an edge case: after 10 weeks, the new site has made 134 sales, the old one has 125. 13/13 after 1 week seems about right, but 134/125 is more than 7% increase.



Imagine a PHB-engineer conversation something like "What are you working on?" "The jQuery plugin we're using doesn't have full support for the latest Chrome nightlies so I'm trying to write some Javascript to achieve the same effect." "Why are you wasting time? I don't know all that much about jQuery, but you can use an iPad to do it."

We can all appreciate that that sort of conversation is pretty exasperating for the engineer, but if he were a nice guy, he'd want to try to explain at least enough engineering to the boss such that the boss understood why "Use an iPad" is not a compelling option there. But he might be disinclined to start that conversation at 2:40 in the morning because it would take a while.

It is 2:40 AM in Japan and improving your understanding of A/B testing would take a while. There exist many comprehensible beginner's guides to it on the Internet. If after reading them you still don't understand why you need two more pieces of data in your hypothetical and why it is not extraordinarily likely that the 7% increase you measure in it reflects an action change in user behavior, I will be happy to explain it to you some day when it is not 2:40 AM.


Some background:

A/B testing is founded on statistics. You take Option A and Option B and see which one achieves more Goal C.

But you can't just look at the percent difference and decide that Option B must be better! Look, it has a higher percent Goal C! But that could be due to chance, so A/B tests employ tests of statistical significance to determine whether the test results are _probably_ chance or _probably_ reflect a genuine causal increase in Goal C.

For example, if you flip a coin four times, and get heads three of those times, without a statistical significance test you might conclude heads is 3x as likely to appear as tails. We know that's wrong, though- each side on a coin has a 50% chance of appearing face-up for each flip.

The flaw in this experiment is that we tried to extrapolate a result from a very small set of data. A statistical significance test would take these results and say "we have a <very small percent> confidence level that heads is more likely to come up, and doesn't just come up more often by chance".

If we flipped the coin 10,000 times instead, you'd get something pretty close to 50% heads and 50% tails, and your significance test would return a high confidence level that those numbers are accurate.

Short story long, you need lots of datapoints to determine whether an A/B test result is chance or an actual difference, and the smaller the difference between how Option A and Option B perform, the more datapoints you need to be confident they're actually different. Patrick's numbers are so close together that he'd need far more than 300 sales to reach the gold-standard 95% confidence level that there's actually a difference.


A test of statistical significance can answer that question easily - I think most optimizers use it.


As well as the issue with the low numbers only showing a very significant effect at this point, there's also the assumption that the redesign will act immediately.

Some products don't have a "search, find and purchase immediately" pattern to sales. Especially when you move out of the B2C market.

Some businesses sales can look more like "Visit half a dozen different sites. Go away for a week and think. Visit best sites again. Go away for a few days and come to a decision. Visit final option, browse and purchase".

Tracking these multiple visits can be non-trivial/impossible since it may be different people and different browsers visiting the site at the different stages. It also leads to long lead-times for the effects that design changes make.


With only 13 purchases on each side of the test, a test of statistical significance is only going to pick out very strong effects. Ones that bump conversion rates more than 10%, typically, which is a pretty huge change.

For more subtle effects, you need more observations, plain and simple; it doesn't matter how you do the math if you don't have the data to reach the significance levels you care about.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: