top | item 9760606

Building Analytics at 500px

173 points| emre | 10 years ago |medium.com | reply

42 comments

order
[+] snotrockets|10 years ago|reply
A free (as in speech and beer) alternative to Periscopre.io would be http://redash.io. You’ll have to host it yourself, but it supports more than just SQL, and is centered around the idea of sharing queries, not reports, which helps shifts the discussion from “how did you get those numbers” to “what do those numbers mean”.
[+] arikfr|10 years ago|reply
re:dash author here if anyone has any questions.

Our company (EverythingMe) had a very similar journey to the one Samson describes (Splunk, MySQL, logs, ... -> Redshift). At the time Persicope wasn't there (or we didn't find it), so we built re:dash instead.

[+] dvirsky|10 years ago|reply
It also has a nice forking model that lets you take other people's work and extend it for your needs, which is super useful if like me you're not part of the analytics team and don't remember the structure of the analytics tables by heart. You just take a report that resembles what you want, and fork it.
[+] garthz|10 years ago|reply
Airflow by airbnb is a nice alternative to luigi. We've been using it for our ETL and it's been working greatly so far. http://airbnb.io/projects/airflow/
[+] mycelium|10 years ago|reply
Can you write a little more about your evaluation of the two?

We've been using the python celery task queue library to coordinate python ETL-esque jobs, but as our setup gets more heterogenous language and job wise we're looking for something built more explicitly for the purpose. Luigi and Airflow are both contenders.

[+] akoumjian|10 years ago|reply
One of my main takeaways from this article is that there is no such thing as easy analytics. In particular, you have to put though and energy into designing your analytics data stores and your ETL process.

It's a bit of a relief to see this after wondering why I've been unable to find an analytics solution that felt complete.

[+] georgewfraser|10 years ago|reply
For anyone setting up a similar system, our product https://fivetran.com automates the most annoying part of this: getting the data into Redshift! We support MySQL, Postgres, Salesforce, and lots of other data sources.
[+] elliott34|10 years ago|reply
yeaaaaaahh but is that really the most annoying? dimensional modeling, getting all the joins right, scripting all the transforms (the T in ETL) is pretty annoying. Getting data into redshift is like, just the beginning. Not to mention, skips the transform part, at least.
[+] Dramatize|10 years ago|reply
Sorry if I'm getting this mixed up.

Snowplow is shown on your pricing page. Doesn't Snowplow already connect directly with Redshift? Or was Fivetran built before Snowplow finished their new setup?

[+] manigandham|10 years ago|reply
Side note - you have caching disabled on the html, css and images of the site, any reason why?
[+] kiyoto|10 years ago|reply
How do you deal with events with varying schema?
[+] jboggan|10 years ago|reply
Great post.

I didn't have this to read several months ago so I did end up writing my own ETL solution. It has been fun though and now we're into the analytics phase and using Chart.io - did you compare that to Periscope? It's a tad pricey but the charts are pretty easy to generate using their query builder if you don't want to get dirty with SQL (or have team members who don't have SQL skills yet).

I do think that rolling your own ETL can be rewarding though - especially if you are wrapping each attempt at an ETL process in a class and storing long term data about the monthly/daily/hourly/irregular processes for internal analysis, forecasting, bug-reporting, and providing fodder for visuals to sell the rest of the team on what you're doing.

[+] uberneo|10 years ago|reply
Little bit more about luigi would be handy .. I believe same can be achieved using Pentaho Kettle as that has a complete workflow structure and its Opensource as well .. however any more examples or use cases of luigi?
[+] omouse|10 years ago|reply
This seems to echo something I said in another thread; the hackers are further away from their users:

Even for technical people, having access to analytics is helpful. Being informed about product metrics helps developers see the ‘why’ behind their work — a key ingredient to high performance

You mean that the technical people who actually implement the software are so alienated from their users that they have to wait for a royal decree to spur them to work?

I wonder how few of their ideas flow back up to the top.

I like the article; it's nice to see one use case of ETL. Too often people directly hit the production database to get reports instead of ETLing into a read-only db that won't affect site performance.

[+] chrisacree|10 years ago|reply
Is there a good guide to getting started using logfiles as the basis for analytics? Best practices on using client vs server-side signals (and if both, not double-counting), etc. I'm interested in learning more about it, but a google search doesn't turn up anything relevant. Snowplow seems to be using a setup like this, but not sure how they'd fit in with what the author was discussing.

I'm using Django, but I imagine this kind of system would be mostly platform-agnostic.

[+] burnout1540|10 years ago|reply
How much of this could be accomplished using something like Mixpanel? Honestly curious.

I know a true ETL solution like this will be more powerful and flexible, but how do you decide when it's worth the investment? Are there key use-cases or reports that only this type of solution can provide? Is the main issue with external analytics (like Mixpanel) that you need to know what data to collect before you can report on it?

[+] jdwyah|10 years ago|reply
MixPanel is nice, but its support for multiple sources of data is not very good. You can slice and dice events, but if you want to join that to your user database, things get awkward/not-possible pretty quickly.

The 2nd thing is that MixPanel is write-once, read-forever. aka if you misspell something or change definitions you end up in a bad place. These ETL processes are much more capable of adapting and normalizing a schema over time.

[+] patrickmclaren|10 years ago|reply
I recently worked on a similar project (with Privy.com), where Mixpanel was one of the external providers we tracked with. Mixpanel handles (AFAIK) an arbitrary number of properties. It was a lot easier than say Google Analytics, where custom dimensions are limited.
[+] gesman|10 years ago|reply
Excellent writeup. I love Splunk for many reasons (built a custom security App for Splunk as a pretty successful fraud detection tool at an enterprise financial firm).

Did you ever looked into Splunk DB connect app?

It allows to use structured DB (such as MySQL and others) as a data source for Splunk. I haven't used it yet but interested in feedback.

[+] kristianp|10 years ago|reply
There's a grammar error in the first sentence. "an quick" should be "a quick".

Great, detailed article though.

[+] mrandrewandrade|10 years ago|reply
Great by post by Samson! I sometime wish I could tell stories and explain like he can
[+] uberneo|10 years ago|reply
Can Airflow be used to load data from a file in filesystem( not in Hive ) to MySql .. also instead of calling a Bashscript can we call a PHP script.. Any pointers will be helpful.
[+] mistercrunch|10 years ago|reply
A PythonOperator using the MySqlHook should do the trick. Oh and you can call a PHP script from BashOperator.
[+] lenguyenthedat|10 years ago|reply
Nice post! We went through a very similar process back in Zalora (http://www.zalora.com) building our own Data Warehouse and Analytics function for over 300 internal users at the time.

Back when Redshift was on its initial beta release (2012), there almost wasn't any ETL / charts tool available for us, so we ended up building most of the tools and libraries ourselves. A few of them were open source too:

- https://github.com/zalora/redsift A web-interface SQL tool for Redshift, letting the user to query and also export them to S3 (and send an email alert to user once done.)

- https://github.com/zalora/postgresql-user-manager for managing user privileges

- https://github.com/lenguyenthedat/aws-redshift-to-rds for copying Tables from Redshift to RDS (postgresql)

- https://github.com/zalora/kraken a bit similar to luigi or airflow.

Things has changed quite a lot since then, there are a lot of great solutions to our problems that are either free or very cheap and production-ready:

- redash.io (https://github.com/EverythingMe/redash) - web-interface SQL tool with visualization - FREE

- redshift_console (https://github.com/EverythingMe/redshift_console) - redshift ops tool - FREE

- flydata (https://www.flydata.com/resources/flydata-sync/sync-rds-mysq...) sync live data from MySQL to Redshift - Subscription base

- dreamfactory (https://aws.amazon.com/marketplace/pp/B00GXYDK18?sr=0-3&qid=...) for providing REST API interface to your database (supports redshift and other databases) - FREE

We didn't use Tableau in Zalora either (due to pricing and the number of users that we have in-house), and ended up building our own customized data dashboards with d3js and a few other different frameworks.

However, as long as you are ok with the price, Tableau is pretty good. It's being use widely in my current company (http://commercialize.tv) :

- You can minimize processing from Tableau server by just create another data mart layer from Redshift with your ETL tools / scripts, having Tableau connecting directly to it.

- Visualizations / Charts creation process is pretty much straightforward. The end-result will look exceptionally comparing to other solutions that we have tried.

- They also have a really good and active community.

[+] misthop|10 years ago|reply
I thought this would be about analytics visualization dashboard with little real estate (500px). Had no idea that 500px was a company name.
[+] Splendor|10 years ago|reply
They're an interesting company IMO. They've done a good job of 'gamifying' photo sharing. I have at least one relative that's addicted to it and pays for an annual subscription.

I'm not sure how well they're doing business-wise. I did notice that they recently pivoted one aspect of their business. They used to facilitate users selling prints of their works; now they've dropped that service and are building a stock photo service instead.

[+] palmeida|10 years ago|reply
For anyone looking for a visualisation solution reasonably priced were you don't need to write SQL http://www.viurdata.com We will be launching support for Amazon Redshift very very soon.
[+] taylozac85|10 years ago|reply
Nice post.

Looker (http://looker.com) is an alternative to Periscope that a ton of venture-backed tech companies are using (mostly in conjunction with Redshift). Here are two posts from Buffer and SeatGeek on their stacks using Redshift, Luigi and Looker. Buffer: https://overflow.bufferapp.com/2014/10/31/buffers-new-data-a.... SeatGeek: http://chairnerd.seatgeek.com/building-out-the-seatgeek-data...

Looker does not necessitate each member of the team to know SQL to explore the data or create reports. It also has a text-based modeling language that is a thin abstraction layer of SQL. It makes SQL modular and reusable making it far more efficient to support a wide range of analysis. It's more expensive than Periscope, but it's way more powerful.

[+] jonbishop|10 years ago|reply
Hey Zack, it’s probably a good idea to disclose that you work at Looker if you’re going to talk about them on HN. https://www.linkedin.com/profile/view?id=145490726

I run marketing for Periscope. Like Chartio, you guys have a great product. We routinely send people your way when they have more business user needs and aren’t a good fit for Periscope.

For those evaluating tools like Periscope/Looker/Chartio:

Whereas Looker focuses on business users, we’ve built Periscope around serving SQL analysts better than anyone else.

If you don't have someone who knows SQL on your team, Looker (or Chartio) is probably a better fit. (though I should mention that some customers have learned SQL through our customer support :)

We’re focused on the SQL analysts who are building dashboards and running reports, day in, day out. With that said, every one of our customers has employees who don’t know SQL, but need to dive into their data. So we’ve built a ton of features to make this easy as well.

If you want to find out more, visit https://www.periscope.io. If Periscope looks like a good fit, sign up and we'll get you set up with a free trial.