Show HN: Statwing is statistical analysis, simplified

[+] taliesinb|13 years ago|reply

I'm glad to see other people working on this problem. We (Wolfram|Alpha) are doing this too, starting out with the making the 'easy cases' nearly automatic.

Here's a blog-post describing our effort: http://blog.wolframalpha.com/2012/02/09/launching-a-democrat...

You can also play around with our examples without having to sign up to Wolfram|Alpha Pro. My favorite is an automatic analysis of the Titanic data that nicely illustrates that while the motto "women and children first" applied, being rich certainly helped: http://www.wolframalpha.com/input/?i=+&examplefile=1&...

We cover other kinds of simple analysis and visualization too, like heat maps, Venn diagrams, graphs, and so on. As always, feedback welcome.

[+] glaugh|13 years ago|reply

Agreed, glad to be a part of the community of folks trying to democratize data analysis. It feels like an important problem to work on, and we're passionate about it (as I'm sure you are, too).

Thanks for chiming in.

[+] anonDataUser|13 years ago|reply

This is great. Do you have any solution for highly sensitive data?

[+] aphyr|13 years ago|reply

The walkthrough took me through finding a correlation between voting preference and neuroticism. Great! But it's also worth noting that this dataset shows larger effect sizes at similar CIs for the correlation between [preference and age] and [age and neuroticism]. This, folks, is why ANOVA is important.

That aside, the product was clear, fast, and intuitive. Well-chosen visualizations and a clean emphasis on the important moments for basic covariate analysis. Well done.

[+] glaugh|13 years ago|reply

Nice, well pointed out.

Unfortunately, we don't have regressions yet. But just to make sure these findings were still valid, we tossed this data into a program that did to make sure the effect remained (it did).

It's no substitute for regressions/ANOVA, but for now here's what Statwing can do: If you add a filter that excludes datapoints below, say, 40 years old. Looking only at folks older than 40, there's no relationship between [neuroticism and age], but the relationship between [neuroticism and preference] remains.

But, point taken. We'll count this as a vote for prioritizing regression. Thanks!

[+] recardona|13 years ago|reply

I do a lot of stat. analysis and I was impressed with the clarity of the analyses. However, I ran through the Obama v. Romney tutorial and was surprised to see that the software was averaging survey items (Likert-scale data). I thought that this was not allowed since it is troublesome to interpret the output (how do you interpret 8.36 Neuroticism?)

Aside from that, I can see this filling a need for those whom are aware of the importance of statistical significance but do not have the time to look up the appropriate analysis function in R/SAS/SPSS/...

[+] mgurlitz|13 years ago|reply

You're right, they shouldn't be making that average. Specifically, Likert data is ordinal, meaning 14 is less than 15 and greater than 13, but the gap between 15 and 14 may be different than 14 and 13.

For example, let's say people measure neuroticism exponentially, and an increase of one point means 10x perceived neuroticism. Because mean(log(x)) != log(mean(x)), the average won't be representative.

Everything else looks OK though: count, median, percentiles, a histogram.

[+] glaugh|13 years ago|reply

Thanks for the comment.

Agreed, Likerts are definitely an area of controversy.

A couple references from Wikipedia about Likert-as-continuous:

Pro: http://www.ncbi.nlm.nih.gov/pubmed/20146096

Con: http://xa.yimg.com/kq/groups/18751725/128169439/name/1Likert...

Our stance is generally a pragmatic one. It's very common practice to analyze data this way, so we enable it. But if you want to analyze your Likerts as ordinal data, you can change the variable type to Ranks (ie Ordinal), where we handle the result as you'd expect (nonparametric test, no averages). Note that this feature is disabled in the demo.

Thanks!

edit: formatting

[+] talbina|13 years ago|reply

There was a company that applied to YC that wanted to do the "Google Docs for Statistics" but was rejected. They wrote about it in a blog post but I can't find it. They ended up not launching.

It will be worth it to connect with these people to see if there is anything that can be learned from them.

[+] glaugh|13 years ago|reply

Definitely let us know if you think of their name or dig up their blog post. Sounds interesting.

[+] jenius|13 years ago|reply

Looks really great overall - props! One small design thing in there that bothered me was how the gradients reverse in the buttons on hover - this should never happen. Just lighten or darken the color on hover (move the gradient up with background position and add a transition is a good trick), then consider reversing the gradient on active (or just adding an inset shadow).

Everything else in the design looks great and this is totally nitpicky, but hope it helps!

[+] lejohnq|13 years ago|reply

I also work on Statwing so thanks very much for the comment.

Now that I look at it more, the front page buttons do look weird compared to all of our other buttons. We've become numb to it after looking at it so often. Most of our buttons do the design thing that you described, so we'll change that shortly! Thanks for the feedback.

[+] Bill_Dimm|13 years ago|reply

Very nice. One tip: Don't require an email address to provide feedback and you'll get more feedback.

A bug that I found in the tour for "Politics and the Big 5":

The instruction bubble says: To run a different analysis, remove "Neuroticism" from the white box by clicking the X to the right of the variable name. But, "neuroticism" is not one of the variables I was using. It seems that something was hard-coded when it shouldn't have been.

[+] glaugh|13 years ago|reply

Ah, thanks a bunch. Appreciate both the bug and the feedback tip. Have a good one, thanks for checking out Statwing.

[+] kylemaxwell|13 years ago|reply

This looks great and I look forward to running some analyses of the same test data between Statwing and Wolfram|Alpha Pro in a mini-bakeoff.

EDIT: Can you talk about your business model any? Sort of a freemium service, or maybe charging for a future API, or something along those lines? Please don't say "ads".

[+] glaugh|13 years ago|reply

Fortunately, people are pretty used to paying for this kind of a product. So we'll do freemium based on number/size of datasets uploaded and some as-yet-unreleased advanced features. Probably throw in some academic discounts for good measure.

Thanks for the question. Cheers!

[+] grantjgordon|13 years ago|reply

Very nice. Who's the target audience for this? Students? Curious enthusiasts? Analysts within companies?

[+] glaugh|13 years ago|reply

We think of our target audience in concentric circles. We'll likely have users from each circle at any given time, but we'll prioritize our product and marketing towards the inner circles then move outwards:

Circle 1. A few specific analysts in a few specific companies we're associated with. They analyze survey data, they use only basic functionality of the fancy tools, and they want a simpler solution.

Circle 2. People analyzing surveys generally. It's a straightforward application where existing tools are way too complicated.

Circle 3. The rest of the 50% of stats tool users that never use more than the core functionality of existing tools (that number is from our research).

Circle 4. People who analyze at work. In particular, Excel power-user analysts and marketing folks for whom the go-to tool for analysis is the pivot table. We want to ease them into the world of more powerful, statistical analysis. We do a lot of usability testing with these folks and we're excited about their reactions so far. But they're not in a lot of pain, so they're not a great initial audience for us.

Grand vision stuff: Tools like SPSS and the like were built in the 80s, and Excel pivot tables were built in the 90s. They've been updated but not overhauled, and there's a gaping hole between them in terms of ease of use and power. As small, rich datasets become ubiquitous, are people in 2020 really going to be using tools from 1990? We hope not.

[+] kirillzubovsky|13 years ago|reply

A statistics application that run on the cloud and looks good too? Yes please! Looking forward to playing around with the data to see what's possible. Where are you guys planning to take this software?

[+] tel|13 years ago|reply

I'm worried for how quickly you can do tests with this interface. I feel my fingers urging for hypothesis hunting---do you have multiple comparison corrections in place?

[+] glaugh|13 years ago|reply

Totally valid. We do multiple comparison protection on ANOVA post hoc tests, but not across all analyses.

Ultimately we'll need to address this. Hopefully doing so (automatically) will differentiate Statwing from other stats package, where one is quite free to shoot one's self in the foot (and one often does).

We'll count this as a vote for the prioritization of that feature.

Thanks for the comment, really appreciate it.

[+] hashpipers234|13 years ago|reply

I can do everything they can do in matlab with your data in less time and with less hassle. my only price is a xmen comic book and a 6 pack of coke.

[+] hokua|13 years ago|reply

Similar to what Swivel was trying to do. Great idea, nice execution, but really how will you monetize this? There is no real market for consumer grade "intuitive" statistical software. While this will appeal to casual data analyzers, these users arnt ready to spend much money on tools. And those doing data analysis for a living prefer their power tools: R, SAS, Matlab, NumPy, etc.

[+] glaugh|13 years ago|reply

Agreed that if you spend most of your day most days doing analysis of large datasets you probably need a power tool.

But there's a whole class of overlooked folks who need to do statistical analysis on smaller datasets on more of a weekly or several-consecutive-intense-days-per-month basis. These folks, who split time between Excel and stats tools, make up a surprisingly high proportion of the user base of stats products. And they tell us they're willing to pay for something that makes their analysis and communication more efficient.

Thanks for the comments, and for the kind words RE the idea and execution.

edit: And to be fair to your point, we're sort of comparing apples and oranges insomuch as you're looking at what we have now (not nearly enough) and we're looking at our roadmap for what we'll have in six months, a year, etc.

[+] jaylevitt|13 years ago|reply

What if they hire some stats bloggers and become the next Nate Silver - only with analyses that WE can all interact with and learn from?

What if this could improve statistical literacy?

[+] doleson|13 years ago|reply

Are there any plans to add-in any realtime feeds? Like say weather data and the Dow jones close to see any correlations?

[+] fywacro|13 years ago|reply

Depends on what you mean by realtime. In most cases (stats-wise, anyway) analyzing a non-stored infinite-length data stream is a very different challenge from analyzing a stored, finite-length data set.

Streaming algorithms do exist for many basic statistical measures. But in many other cases, the best streaming algos aren't cheap or accurate enough to be useful.

Bucketing can sometimes substitute for a bona fide streaming algorithm. But again, there's plenty of cases where bucketing won't work well enough to make it useful.

I haven't really looked at Statwing yet--the premise is really tantalizing, though. Gotta find an excuse to throw a spreadsheet in there and see what comes back.

[+] georgek|13 years ago|reply

+1. I really like the intuitive interface and the speed with which I can conduct analysis. It would be great to have a library of feeds for each user that is automatically curated / updated. This library could include both public datasets (fore free) but also proprietary feeds specific to my industry or even my company that are only accessible by me (which I would pay for).

[+] jqueryin|13 years ago|reply

While I appreciate the graphs, I'd also like to see the numbers if I hover over wording that says "Very clearly significant". What confidence interval are we talking about? 95%?

If I was you, I'd hide this information from the average user but make it available in a tooltip to those of us who care.

[+] glaugh|13 years ago|reply

Good call RE the tooltip.

Just for reference, everything's at 95% confidence. We do mention that in the Advanced output but it's perhaps a bit too hidden.

[+] leeny|13 years ago|reply

The optional upgrade survey appears to be broken. After I submit the survey, I get redirected back to the login page with my username in the query string. After I click "login", I get the alert telling me I can take a survey to upgrade. Rinse. Repeat.

[+] dlf|13 years ago|reply

I absolutely love this. I'm learning to code (slowly) and have an infatuation with data visualization. I've imagined what something like this might look like, and I think you guys absolutely nailed it. Well done!

[+] dlf|13 years ago|reply

P.S. I shared this with the Maxwell School alumni group on LinkedIn, so hopefully that drives some traffic your way! I think that my fellow MPA alums will dig it.

[+] duaneb|13 years ago|reply

Very cool. Why should I use this instead of R/gnuplot?

[+] lejohnq|13 years ago|reply

Thanks!

We are trying to make Statwing automatically display the right analyses for the portions of your data you are most interested in.

If we can accomplish that, then hopefully we've helped make you faster at understanding the relationships in your data. Maybe that is enough so you don't need to break out R for basic analyses. Otherwise I would also use R. I made some graphs in R that wouldn't be able to do in Statwing right now, but if we can output the right things based on your data then hopefully you could save some time with us.

[+] leeny|13 years ago|reply

I'd like to throw in another vote for prioritizing regressions and specifically adding logistic regressions to the mix. Thanks!

[+] mcarvin|13 years ago|reply

demo is very cool. love anything that can make pattern recognition in large datasets this much easier.

[+] danso|13 years ago|reply

1. Great looking product. I clicked through a little bit and liked the general polish, but didn't have time to explore everything.

2. For people whose jobs involve statistical analysis, how much need is there for something like this? The more analysis I do, though, the more I realize that the hard part is collecting the data and programmatically "piping" it from package to package...And from professionals I know in various numbers-based industries, their biggest blind spot seems to be that ability to gather data that doesn't come in a CSV/Excel sheet for them.

* edit: in addition to the challenges presented above, the challenge of cleaning data so that a package like Statwing can do a proper analysis

[+] glaugh|13 years ago|reply

Thanks, really appreciate #1.

Agreed that quite often the hardest part is getting the data together (particularly on the web). But from our perspective, it's still true that conducting the actual analysis and visualizing the data should be a lot easier than it is. And that's particularly true if you're in our initial audience, the roughly 50% of SPSS/R/Minitab/etc. users who never use anything past the basic functionality of those programs.

I guess a simpler answer is that we think there's a need because this is a product that I badly wanted when I was an analyst/consultant, splitting time between Excel and the basic functionality of SPSS.

edit: Also, and this isn't very helpful, but we talk to a ton of people about their data analysis needs, and we hear a good chunk of them talk a lot about the pain of using highly technical solutions for relatively simple problems like analyzing a survey.

[+] swalsh|13 years ago|reply

"For people whose jobs involve statistical analysis, how much need is there for something like this?"

I don't know who the creators are targeting, but perhaps it's not experienced people? One of the great parts of the internet is the way it seems to lower the barrier of entry for just about everything. If this app can help people learn how to do simple statistical analysis, there just might be a load of value there. I didn't really click around, but something they might think about doing is allowing users to hyperlink directly to an "analysis session" so a blogger can not only link to a graph, but the data itself. I can imagine a scenario where a blogger writes a post about maybe housing prices, and draws some unreasonable conclusion. Then a reader goes, and adds in inflation data which changes the story. He then replies in the comments with a link to his new "analysis session" spurring a new conversation.

Its probably taking the tool to a different direction, but i really like what's here.

50 comments