Big Data at Khan Academy

t1m · on Oct 11, 2013

It's interesting work, but it's not really 'big data'.

"Every day, we collect around 8 million data points on exercise and video interactions, and a few million more around community discussion, computer science programs, and registrations. Not to mention the raw web request logs, and some client-side events we send to MixPanel."

OK - 8 million records per day. Let's double that for the argument's sake.

Even if they were fairly fat records (1Kb), that's only 16Gb / day. That makes it around 2 months / TB.

I can easily put together a machine with 20TB of storage and run a traditional free relational DB (or even a single free node of Greenplum) and store more than 3 years of this data.

Then bang against it with SQL. Transactions are free.

yeukhon · on Oct 11, 2013

Big data doesn't mean you need 100TB per month. It simply means you have a lot of data and so enormous that you cannot just read through all the data and analyze without more durable methods of computations. And 8 million per day is a lot.

The real question is out of those records they have collected, how much useful data can they extract and what exactly can they extract out that data set beside just who visited from where, etc.

nl · on Oct 11, 2013

Interesting.

There's a whole emerging field called "learning analytics", which at the moment appears to be more a theoretically good idea than anything with practical outcomes (Sadly, much in education is like this - something will emerge in the technology field, and then 6 months later there will be a XXX-in-education movement) - although Khan Academy is in a good position to get that data and use it.

But for those of you who have kids who do Kumon Math (or similar) it's pretty easy to see how analytics could speed up the Kumon process (of selecting questions that exercise very specific skills).

For those interested there is an upcoming "Big Data in Education" Coursera course[1] that I'm planning on doing. It will be my first coursea experience, so I'm not quite sure what to expect. I'm in the fortunate position of having access to a fairly significant amount of educational usage data, so I'm hoping it will be useful.

[1] https://www.coursera.org/course/bigdata-edu

ZeroGravitas · on Oct 11, 2013

Only 6 months later? I'd pegged it around 3-5 years.

Your Kumon math suggestion is exactly what Khan academy are doing as their first big usage of this stuff. See the links I posted elsewhere for more info from them.

I think the big difference is that Khan Academy is building their whole approach around the analytics, rather than vice versa. There's plenty of "free" info you can collect from web or online courses or question banks, but I've not seen any real concept for feeding that back into the course construction.

For example, at one "learning analytics" thing I attended it was discovered that the really big indicator of whether a student was going to pass or fail was whether they did well on exams. This fact though was buried so deep behind a super simple traffic light dashboard that no user would have ever been able to figure it out and do something useful with that info, like for example change the course to have more and earlier testing, which is easier than ever with modern technology.

For whatever reason, just testing student's knowledge isn't considered quite as exciting as trying to predict their success by applying dizzyingly complex math to the trail they leave on the web.

grovulent · on Oct 11, 2013

The quality of a course really depends on who is running it and how enthusiastic they are. So I would recommend keeping an open mind about coursera until you've taken a few different courses.

Also depends on the students that take it. Sometimes you get incredibly helpful people taking the course - that seem to know everything and beyond - and don't mind taking the time giving really detailed explanations on the forums.

I'm doing the calculus course atm - and enjoying it a lot.

alexdean · on Oct 11, 2013

Isn't this a flawed approach? It seems like Khan Academy is trying to re-construct a record of behaviours across their business by stitching together:

1. Parsing web logs for web page views and API accesses

2. Exporting "some client-side events" from MixPanel

3. Mining their transactional databases for state changes

On #1 - web caching and client-side events have long invalidated web log based analytics approaches. How is Khan different?

On #3 - this is reverse engineering your user behaviours by mining state changes in your transactional systems. This is typically a ton of work, it breaks when you change your data models, and your operational systems aren't designed to reveal user behaviours anyway.

Have Khan explored alternative approaches? Typically: defining with the analyst team a set of events you want to monitor, making sure all of your systems (client-side, mobile, server-side, whatever) emit immutable streams of these events, and then collecting, storing, enriching, analyzing at your leisure.

noelwelsh · on Oct 11, 2013

This was a nice read but I'm much more interested to know what they do with the data. From hanging around "big data" people the emphasis still seems to be on storage and simple SQL-esque querying. For most people this is a solved problem, and it's time to go beyond storage and see what value we can get from data. I believe in most cases this requires a different skill set and different mindset. Most people think in binary terms, but statistical models deal with shades of grey -- nothing is ever certain -- and even simple models like linear regression are difficult for the untrained to understand.

ZeroGravitas · on Oct 11, 2013

A couple of older posts about what they're doing:

How Khan Academy is using Machine Learning to Assess Student Mastery

http://david-hu.com/2011/11/02/how-khan-academy-is-using-mac...

Khan Academy: Machine Learning → Measurable Learning

http://derandomized.com/post/51729670543/khan-academy-machin...

I think the latest Dashboard design that they've just (soft-?) launched is based on this work. When you first start it asks you about 10 questions and based on your performance on that short "Maths pre-test" guesses what you know and what you don't across the whole of Maths.

scorpion032 · on Oct 11, 2013

Do you anybody else that uses Google App Engine?