yuanchuan's comments

yuanchuan | 10 years ago | on: Ask HN: How to handle 50GB of transaction data each day? (200GB during peak)

I once worked on similar project. Each day, the amount of the data coming in is about 5TB.

If your data are event data, e.g. User activity, clicks, etc, these are non-volatile data which should preserve as-is and you want to enrich them later on for analysis.

You can store these flat files in S3 and use EMR (Hive, Spark) to process them and store it in Redshift. If your files are character delimited files, you can easily create a table definition with Hive/Spark and query it as if it is a RDBMS. You can process your files in EMR using spot instances and it can be as cheap as less than a dollar per hour.

yuanchuan | 10 years ago | on: Crystal, iOS ad blocker, to accept money to let ads through

Correct me if I'm wrong. I watched the Safari Content Blocker video that is presented in WWDC 2015 and it mentioned that the list of content to be filtered is compiled to bit code instead of reading it as a JSON file, which makes it more efficient and less draining on CPU. Since it is compiled down to bit code, 32-bit will not be compatible to 64-bit and that's why only the newer iPhones and iPads are compatible. It is not that iPhone 5 is not powerful enough but simply the CPU architecture doesn't support.

yuanchuan | 10 years ago | on: Launching a product in just 3652 days

Can totally relate to this. I have written, scrapped, re-written the code a few times for the past 4 years (1461 days). I am almost there!

Great advice and now I need to get things started again.

yuanchuan | 11 years ago | on: Command-line tools can be faster than your Hadoop cluster

On-premise cluster.

Cloud solution are totally out due to the nature of the data. Not everything can be done in cloud.

If you have such huge amount of data, the total amount of time it takes to transfer there and compute is not as competitive as an on-premise solution, unless all your data live in the cloud.

yuanchuan | 11 years ago | on: Command-line tools can be faster than your Hadoop cluster

It is that buzz surrounding Hadoop that makes people misunderstood its use and capability. I have met non-technical analysts who want RDBMS performance on Hadoop. They expect seconds to minutes scale queries on hundreds of GB of data.

I always throw this analogy to people who misunderstood Hadoop: A stone to crack an egg or a spoon?

Hadoop and RDBMS only have a thin overlapping region in the Venn diagram that describes their capabilities and use cases.

Ultimately, it is cost vs efficiency. Hadoop can solve all data problems. Likewise for RDBMS. This is an engineering tradeoff that people have to make.

page 1