mkaufmann | 6 years ago | on: Umbra: an ACID-compliant database built for in-memory analytics speed
mkaufmann's comments
mkaufmann | 7 years ago | on: Alibaba acquires Berlin-based data Artisans for $103M
One part of the acquisition is the integration of Alibaba's in-house modifications called Blink. It will be interesting to see what is behind that. To me it looks like this could be a very health collaboration.
mkaufmann | 10 years ago | on: Why Engineers Can’t Stop Los Angeles' Enormous Methane Leak
Regarding your correction. The percent value is now too low (.002%). I think you used the factor value of ~0.002 as percentage. So you probably meant 0.2% which would be close to my number.
Using your numbers
30000/18200000 = .00164835164835164835 = ~0.16% = ~0.2%
mkaufmann | 10 years ago | on: Why Engineers Can’t Stop Los Angeles' Enormous Methane Leak
The leak is emitting 1247 metric tons of CO2e per hour that is 29928 metric tons of CO2e per day. So its 0.17% instead of 9% of nation-wide emissions.
mkaufmann | 10 years ago | on: Why Engineers Can’t Stop Los Angeles' Enormous Methane Leak
mkaufmann | 10 years ago | on: Why Engineers Can’t Stop Los Angeles' Enormous Methane Leak
mkaufmann | 10 years ago | on: Why Engineers Can’t Stop Los Angeles' Enormous Methane Leak
EDIT: The old version had used the wrong conversion factors, now corrected
The central number is the exhaust mass of 110,000 pounds per hour. So how much is this really? This is about 50 metric tons of methane per hour. To be able to compare it with other greenhouse emissions we can calculate the CO2 equivalent by multiplying with 0.01133[1] giving the rate of: 1247 metric tons CO2e per hour.
Using the EPA Online tool[2] we can relate this to the toal emissions in Calfornia or the US. The total emission of methan measured in CO2e for California in 2014 was: 9,546,270 metric tons CO2e. Converted to a rate per hour this gives: 1089 metric tons CO2e per hour.
So while the well is leaking it is releasing 114% of the normal methane emissions of California.
Compared to all greenhouse gas emissions the well is causing an increase of 10% in california and 0.3%at the US national level compared to the emissions from large facilities.
[1] http://www3.epa.gov/gasstar/tools/calculations.html
[2] http://ghgdata.epa.gov/ghgp/main.do
[3] http://www.laweekly.com/news/what-went-wrong-at-porter-ranch...
mkaufmann | 10 years ago | on: NoDB: Efficient Query Execution on Raw Data Files
The whole premise of the paper is that data can be analyzed in situ. That means loading from its original place in the original format without any previous transformations. This is in contrast to the traditional approach of database systems that the data has to be loaded first into a database.
This paper describes a way how unprocessed unindexed data can be efficiently used to answer queries using a database system. The novelty of this approach is mostly that they build a index on the fly that can be reused later and examining the idea of directly using the raw files. For efficient loading of CSV files which is also mentioned in the NoDB paper, I think those details are better described in a later paper from TU München[1] that examines this aspect in more detail.
HDT for RDF or Parquet[2] and ORCFiles[3] in the Big Data space, are (binary) formats where the data is already processed and stored in a more efficient format than plain old text CSV files. Creating these files can already be compared to just loading the data into a database. The only difference that format used for data storage is open and can be used by many systems. So its a completely different setting.
Still its an interesting thought to make databases aware of the indexed information in those file formats besides CSV so that these can also be directly used without loading.
[1] http://www.vldb.org/pvldb/vol6/p1702-muehlbauer.pdf
[2] https://github.com/Parquet/parquet-format
[3] https://cwiki.apache.org/confluence/display/Hive/LanguageMan...
mkaufmann | 10 years ago | on: 15 Biggest Ships Create More Pollution Than All Cars in the World (2013)
When talking about pollutants is important to keeps the following points in mind:
- Is the pollutant effecting health?
- Is the pollutant effecting climate change?
- Is the amount of pollution locally concentrated or very distributed?
Cargo ships typically have a very high emission of nitrogen Oxides (NOx) and sulphur dioxide. When emitted by cars / factories on the mainland these often strongly contribute to harmful smog especially in megacities or cities with poor ventilation. Also they can be generally bad to the ecosystem also on the water due to causing acid rain etc.
The main pollutants on cargo ships have a very strong short term effect but are often out of the air in a few weeks. Because of that they don't rise to the atmosphere and don't directly contribute to long term climate changes. Thus I think the comparison in the article is very dangerous. When considering pollutants that effect climate change cars are more dominant.
So depending on which effects are discussed reducing pollutions from cars can still be benefitial. As an additional thought, the polutions of the cargo ships are spread out over a very large geographich area while the exhausts of cars are much more concentrated around cities. So when considering ones own quality of living, cars have a much bigger impact.
I think an article that better manages to discuss the subject is this one from the guardian: http://www.theguardian.com/environment/2009/apr/09/shipping-...
Excerpts:
- Shipping is responsible for 18-30% of all the world's nitrogen oxide (NOx) pollution and 9% of the global sulphur oxide (SOx) pollution.
- Shipping is responsible for 3.5% to 4% of all climate change emissions
mkaufmann | 10 years ago | on: High Quality Video Encoding at Scale
Another problem is that you have to encode the movie for each codec profile times the number of different bitrates per profile. The article mentions four profiles (VC1, H.264/AVC Baseline, H.264/AVC Main and HEVC) and bitrates ranging from 100 kbps to 16 Mbps. Assuming now there are 20 different bitrates per code you already get 4*20 => 80 encoded copies per source. But of course this can be solved by parallelism.
mkaufmann | 10 years ago | on: Does Google crawl dynamic content?
Trying to find any of the other search strings in the article for the different loading variants does not return any results. So no variant of javascript injected content is working on Bing currently.
mkaufmann | 11 years ago
mkaufmann | 11 years ago | on: Show HN: ACM SIGMOD Programming Contest – Optimistic Concurrency Control
mkaufmann | 11 years ago | on: How to get into an admin account on a Windows computer
The settings can be changed with bcdedit:
bcdedit /set {default} recoveryenabled No
bcdedit /set {default} bootstatuspolicy ignoreallfailures
Additionally booting from USB/... should be disabled in the BIOS/UEFI options and also access to that should be password secured.Further more because the person has physical access, the computer should be locked away so that the harddrive can't be accessed. Also all cables should be secured so that no sniffer can be plugged in between. This also especially includes the USB ports on the monitor if those are enabled.
mkaufmann | 11 years ago | on: Germany plans highway test track for self-driving cars
Also having a real highway allows testing scenarios like partially blocked street due to construction, testing at night, with rain, with snow, etc. Google's cars are currently not working under such conditions so there are still many problems to solve.
[1] http://wwwlehre.dhbw-stuttgart.de/~reichard/content/person/d... (Sorry no english news)
mkaufmann | 11 years ago | on: Columnarization in Rust
Using your original benchmark I than show that the row layout brings a large performance win in the benchmark. My numbers show this performance win not just in throughput (bytes per sec.) but als in "goodput" (values per sec.). Check my previous comment. I just noticed though that I sometimes forgot the k in the reported numbers, where it is missing you have ti multiply the number times 1000, I can't edit it anymore.
I guess we can agree to disagree and should continue the discussion in another form ;)
PS: I just updated my github information, I am now at the TU Munich
mkaufmann | 11 years ago | on: Columnarization in Rust
1. In my comment I acknowledge that there are space savings with types that include padding. I don't understand why you imply in your answer that I didn't understand this point. Regarding my comment about not copying data: it was based on the benchmark[1] as linked in the blog post. The parts you measure do not copy any of the user data! It only converts your internal vectors of user types to a list of vectors of u8 type. So what you are doing is essentially moving the data. But when using move semantics the size of the user data does not matter any more, so there won't be a difference between your column layout and using the more complex types directly inside the vectors if measured the same way as in this benchmark. In my opinion your benchmark is flawed and does not support the argument you make in your blog post.
2. Ok yes that was bad wording on my side. If you provide adaptors for complex types with pointers like Vec than you can of course also serialize those.
3. I guess this is just about the same argument as point 2) and thus redundant.
I made the effort for you to write a serialization framework which does not do "columnarization" but which simulates row layout and also added an adaptor for vec and ran it with the same benchmark, and also reran the original columnar benchmark on my machine. Both benchmarks were compiled exactly the same way. You can find my code here[2]. Here are the results:
==============================
columnarization <uint> 12.1 GB/s, 742k values/s
columnarization <(uint, (uint, uint))> 5.3 GB/s, 107k values/s
columnarization <vec<uint>> 2.1 GB/s, 54k values/s
columnarization <Option<uint>> 1.58 GB/s, 163k values/s
==============================
row (simple_serialize) <uint> 12.1 GB/s, 730k values/s
row (simple_serialize) <(uint, (uint, uint))> 8.83 GB/s, 164 values/s
row (simple_serialize) <vec<uint>> 2.13 GB/s, 56 values/s
row (simple_serialize) <Option<uint>> 8.82 GB/s, 263k values/s
==============================
You see that columnization does not have a performance benefit in your benchmark and it even is significantly slower for Option<uint> type and pairs.
In your blog post you never mentioned that columnization will only has the potential to bring performance benefits to de-/serialization when using types with large padding overheads. I think this discussion would probably helped the blog post. It would be much better if it would either omit the currently wrong performance argument and just focus on the nicely typed API or if it would use a proper benchmark which would support your argument, which from my understanding would be only possible in a very limited set of use case scenarios.
Here is a list of other valid arguments you could have made instead:
* Format saves space at the cost of performance.
* Better than repr(packed) as it will also work on platforms that don't support unaligned access
[1] https://github.com/frankmcsherry/columnar/blob/master/exampl... [2] https://github.com/mkaufmann/simple_serialize
mkaufmann | 11 years ago | on: Columnarization in Rust
> "... columnarization, a technique from the database community for laying out structured records in a format that is more convenient for serialization than the records themselves."
Column stores in comparison to row stores don't offer any serialization benefit per se. The main benefits are the following, I will be using a record (A,B,C,D,E) as example with all types u32 (4 bytes):
* If you only use some fields you have to load less data from memory/disk into the CPU cache and your working set is more probable to fit into cache. For example when filtering only the records where A=22 and B=45 you only have to actually load x(sizeof(A)+sizeof(B)) = x8 bytes instead of xrecord_size=x20. This can make a very significant difference. * When using compression to reduce the size of data, columns can often be compressed better because they only contain data of the same type and nature and thus probably share similarities. When using such a small record consisting only of integers it probably won't make a difference. But if e.g. some fields are country abbreviations, textual description or others are ids, one could easily imagine that there are gains.
Coming back to the point about serialization, using the same technique as described in the blog post, there won't[1] be a performance difference between column storage and row storage (e.g. using a struct). The method described in the blog post just lets the data array of the original vector be wrapped by a Vec<u8> without even moving the memory, so the method is independent of the data type that is stored in the vectors. Of course it will only work for data types that do not contains references, otherwise we could get illegal memory access after deserialization (which should be guaranteed by the rust type system because only Copy types are allowed).
The only thing this benchmark is testing is how fast a vector can be initialized.
[1] There can be an space improvement of keeping the data in a column layout compared to row layout when using normal structs. Normal structs normally align the total size to the size of the largest field in the struct. A struct containing i64 and i8 would contain 7 bytes of padding. In a column layout this overhead would be avoided. Still there would not be an improvement in this serialization scheme as it does not actually copy any data.
mkaufmann | 11 years ago | on: How shaving 0.001s from a function saved $400/mo on Amazon EC2
My initial assumption is that the 2.2M pages per ~18h are the main workload. This is also supported by the chart at the bottom, outside of the 18h timespan there is hardly any baseload. The blog additionally gives the following facts: 18 c1.medium instances and ~60% utilization after the optimization (taken from the chart).
Now this allows us to calculate the time per page. First the time for the total workload per day is num_machines(cpu_time_per_machine)=18machines(18h*0.6)=194h of processing per day.
On page level this is than 194h/2.2M=317ms per page.
This feels really slow, and should even be multiplied by two to get the time per cpu core (the machines have two cpu cores)! I would guess that the underlying architecture is probably either node.js or ruby. Based on these performance characteristics the minimum cost for this kind of analysis per day is $25. For customers this means that on average the value per 1k analyzed pages should be at least $1.13. I think this is only possible with very selective and targeted scraping, given that this only includes extracting raw text/fragments from the webpages and does not include further processing.
mkaufmann | 11 years ago | on: Higgs JavaScript Virtual Machine
[1] http://pointersgonewild.wordpress.com/2014/11/14/the-fastest...
[2] http://databasearchitects.blogspot.de/2014/09/experiments-hu...
EDIT: Many conferences also allow publishing papers without a deep comparison to exisiting research in the industrial session. This allows demonstrations of interesting implementation variants or system choices
EDIT 2: The review critisim in older blog posts "Conference reviewers criticized us for not discussing compilation times, and raised the issue that perhaps basic block versioning could drastically increase compilation times." is also very valid for a jit compiler. Again discussion does not have to mean that you have to be faster than all existing systems. Paper acceptance is always a little bit a random process, but at least in this case the review comments are valid from my point of view and her phd advisor should probably have detected these problems in proofreading before submitting. I really hope that the paper will finally be accepted!
I especially like the super fast CSV scanning!