mkaufmann's comments

mkaufmann | 7 years ago | on: Alibaba acquires Berlin-based data Artisans for $103M

I've been following data Artisans for a while and love their tech! It seemed that being not from silicon valley made it more difficult for them to gain traction.

One part of the acquisition is the integration of Alibaba's in-house modifications called Blink. It will be interesting to see what is behind that. To me it looks like this could be a very health collaboration.

mkaufmann | 10 years ago | on: Why Engineers Can’t Stop Los Angeles' Enormous Methane Leak

Damn these numbers! :D

Regarding your correction. The percent value is now too low (.002%). I think you used the factor value of ~0.002 as percentage. So you probably meant 0.2% which would be close to my number.

Using your numbers

30000/18200000 = .00164835164835164835 = ~0.16% = ~0.2%

mkaufmann | 10 years ago | on: Why Engineers Can’t Stop Los Angeles' Enormous Methane Leak

I think you are mixing the number up. I could not find a source that said anything close to the 1.7 million metric tons per day emitted by the leaking well you used for your calculation.

The leak is emitting 1247 metric tons of CO2e per hour that is 29928 metric tons of CO2e per day. So its 0.17% instead of 9% of nation-wide emissions.

mkaufmann | 10 years ago | on: Why Engineers Can’t Stop Los Angeles' Enormous Methane Leak

On the topic of what went wrong I think this LA weekly[3] article is a much better source. The main problem why the well can't be shut down is that the security valve was removed about 40 years ago: "He pointed out that the valve was old at that time and leaking. It also was not easy to find a new part, so the company opted not to replace it.". Certainly a bad decision and it should be checked if regulations need to be changed to avoid similar problems in the future.

EDIT: The old version had used the wrong conversion factors, now corrected

The central number is the exhaust mass of 110,000 pounds per hour. So how much is this really? This is about 50 metric tons of methane per hour. To be able to compare it with other greenhouse emissions we can calculate the CO2 equivalent by multiplying with 0.01133[1] giving the rate of: 1247 metric tons CO2e per hour.

Using the EPA Online tool[2] we can relate this to the toal emissions in Calfornia or the US. The total emission of methan measured in CO2e for California in 2014 was: 9,546,270 metric tons CO2e. Converted to a rate per hour this gives: 1089 metric tons CO2e per hour.

So while the well is leaking it is releasing 114% of the normal methane emissions of California.

Compared to all greenhouse gas emissions the well is causing an increase of 10% in california and 0.3%at the US national level compared to the emissions from large facilities.

[1] http://www3.epa.gov/gasstar/tools/calculations.html

[2] http://ghgdata.epa.gov/ghgp/main.do

[3] http://www.laweekly.com/news/what-went-wrong-at-porter-ranch...

mkaufmann | 10 years ago | on: NoDB: Efficient Query Execution on Raw Data Files

It does not really compare, I guess thats also why its not mentioned.

The whole premise of the paper is that data can be analyzed in situ. That means loading from its original place in the original format without any previous transformations. This is in contrast to the traditional approach of database systems that the data has to be loaded first into a database.

This paper describes a way how unprocessed unindexed data can be efficiently used to answer queries using a database system. The novelty of this approach is mostly that they build a index on the fly that can be reused later and examining the idea of directly using the raw files. For efficient loading of CSV files which is also mentioned in the NoDB paper, I think those details are better described in a later paper from TU München[1] that examines this aspect in more detail.

HDT for RDF or Parquet[2] and ORCFiles[3] in the Big Data space, are (binary) formats where the data is already processed and stored in a more efficient format than plain old text CSV files. Creating these files can already be compared to just loading the data into a database. The only difference that format used for data storage is open and can be used by many systems. So its a completely different setting.

Still its an interesting thought to make databases aware of the indexed information in those file formats besides CSV so that these can also be directly used without loading.

[1] http://www.vldb.org/pvldb/vol6/p1702-muehlbauer.pdf

[2] https://github.com/Parquet/parquet-format

[3] https://cwiki.apache.org/confluence/display/Hive/LanguageMan...

mkaufmann | 10 years ago | on: 15 Biggest Ships Create More Pollution Than All Cars in the World (2013)

The article lacks a description of which specific form of pollution is compared.

When talking about pollutants is important to keeps the following points in mind:

- Is the pollutant effecting health?

- Is the pollutant effecting climate change?

- Is the amount of pollution locally concentrated or very distributed?

Cargo ships typically have a very high emission of nitrogen Oxides (NOx) and sulphur dioxide. When emitted by cars / factories on the mainland these often strongly contribute to harmful smog especially in megacities or cities with poor ventilation. Also they can be generally bad to the ecosystem also on the water due to causing acid rain etc.

The main pollutants on cargo ships have a very strong short term effect but are often out of the air in a few weeks. Because of that they don't rise to the atmosphere and don't directly contribute to long term climate changes. Thus I think the comparison in the article is very dangerous. When considering pollutants that effect climate change cars are more dominant.

So depending on which effects are discussed reducing pollutions from cars can still be benefitial. As an additional thought, the polutions of the cargo ships are spread out over a very large geographich area while the exhausts of cars are much more concentrated around cities. So when considering ones own quality of living, cars have a much bigger impact.

I think an article that better manages to discuss the subject is this one from the guardian: http://www.theguardian.com/environment/2009/apr/09/shipping-...

Excerpts:

- Shipping is responsible for 18-30% of all the world's nitrogen oxide (NOx) pollution and 9% of the global sulphur oxide (SOx) pollution.

- Shipping is responsible for 3.5% to 4% of all climate change emissions

mkaufmann | 10 years ago | on: High Quality Video Encoding at Scale

For x264 that is true, HEVC which is also mentioned is much slower. For a 4k source transcoding can take more than a second per frame. For a normal movie this can quickly result in encoding times of more than a day.

Another problem is that you have to encode the movie for each codec profile times the number of different bitrates per profile. The article mentions four profiles (VC1, H.264/AVC Baseline, H.264/AVC Main and HEVC) and bitrates ranging from 100 kbps to 16 Mbps. Assuming now there are 20 different bitrates per code you already get 4*20 => 80 encoded copies per source. But of course this can be solved by parallelism.

mkaufmann | 10 years ago | on: Does Google crawl dynamic content?

I just did this page. The page is indexed. When I look for the search term "Update this was posted to Google on Friday the 17th of July, 2015. Monday, the 20th" the page is shown.

Trying to find any of the other search strings in the article for the different loading variants does not return any results. So no variant of javascript injected content is working on Bing currently.

mkaufmann | 11 years ago

I like to use -Wpedantic for my own projects to keep the code clean and it helped my well. Its not included in -Wall and -Wextra.

mkaufmann | 11 years ago | on: Show HN: ACM SIGMOD Programming Contest – Optimistic Concurrency Control

I would be interested why the go implementation is so much slower than the Java implementation. The algorithms behind both implementations seem to be exactly the same and both are garbage collection languages. There probably has to be a stupid performance mistake somewhere. I tested with the small dataset from the task page.

mkaufmann | 11 years ago | on: How to get into an admin account on a Windows computer

Yes because the recovery menu won't be accessible than (which is needed to replace the sticky key executable)

The settings can be changed with bcdedit:

    bcdedit /set {default} recoveryenabled No 

    bcdedit /set {default} bootstatuspolicy ignoreallfailures
Additionally booting from USB/... should be disabled in the BIOS/UEFI options and also access to that should be password secured.

Further more because the person has physical access, the computer should be locked away so that the harddrive can't be accessed. Also all cables should be secured so that no sniffer can be plugged in between. This also especially includes the USB ports on the monitor if those are enabled.

mkaufmann | 11 years ago | on: Germany plans highway test track for self-driving cars

Driving on a german highway with autonomous cars has already been done as already as 1996 by Mercedes[1]. But same as with the google cars they used very special hardware which would be too expensive for production use. The current challenges are to make autonomous driving happening with fewer and cheaper sensors.

Also having a real highway allows testing scenarios like partially blocked street due to construction, testing at night, with rain, with snow, etc. Google's cars are currently not working under such conditions so there are still many problems to solve.

[1] http://wwwlehre.dhbw-stuttgart.de/~reichard/content/person/d... (Sorry no english news)

mkaufmann | 11 years ago | on: Columnarization in Rust

The changes make all the difference. Your argument was that columnization and thus storing each native type in a data record in seperate columns brings a performance benefit. By removing the specialisation of Pair and Option which save the data in seperate columns I switched back to a classical storage which basically stores the data of one record at one place like a row store. So your code simulates a column store and my a row store.

Using your original benchmark I than show that the row layout brings a large performance win in the benchmark. My numbers show this performance win not just in throughput (bytes per sec.) but als in "goodput" (values per sec.). Check my previous comment. I just noticed though that I sometimes forgot the k in the reported numbers, where it is missing you have ti multiply the number times 1000, I can't edit it anymore.

I guess we can agree to disagree and should continue the discussion in another form ;)

PS: I just updated my github information, I am now at the TU Munich

mkaufmann | 11 years ago | on: Columnarization in Rust

Addressing your points:

1. In my comment I acknowledge that there are space savings with types that include padding. I don't understand why you imply in your answer that I didn't understand this point. Regarding my comment about not copying data: it was based on the benchmark[1] as linked in the blog post. The parts you measure do not copy any of the user data! It only converts your internal vectors of user types to a list of vectors of u8 type. So what you are doing is essentially moving the data. But when using move semantics the size of the user data does not matter any more, so there won't be a difference between your column layout and using the more complex types directly inside the vectors if measured the same way as in this benchmark. In my opinion your benchmark is flawed and does not support the argument you make in your blog post.

2. Ok yes that was bad wording on my side. If you provide adaptors for complex types with pointers like Vec than you can of course also serialize those.

3. I guess this is just about the same argument as point 2) and thus redundant.

I made the effort for you to write a serialization framework which does not do "columnarization" but which simulates row layout and also added an adaptor for vec and ran it with the same benchmark, and also reran the original columnar benchmark on my machine. Both benchmarks were compiled exactly the same way. You can find my code here[2]. Here are the results:

==============================

columnarization <uint> 12.1 GB/s, 742k values/s

columnarization <(uint, (uint, uint))> 5.3 GB/s, 107k values/s

columnarization <vec<uint>> 2.1 GB/s, 54k values/s

columnarization <Option<uint>> 1.58 GB/s, 163k values/s

==============================

row (simple_serialize) <uint> 12.1 GB/s, 730k values/s

row (simple_serialize) <(uint, (uint, uint))> 8.83 GB/s, 164 values/s

row (simple_serialize) <vec<uint>> 2.13 GB/s, 56 values/s

row (simple_serialize) <Option<uint>> 8.82 GB/s, 263k values/s

==============================

You see that columnization does not have a performance benefit in your benchmark and it even is significantly slower for Option<uint> type and pairs.

In your blog post you never mentioned that columnization will only has the potential to bring performance benefits to de-/serialization when using types with large padding overheads. I think this discussion would probably helped the blog post. It would be much better if it would either omit the currently wrong performance argument and just focus on the nicely typed API or if it would use a proper benchmark which would support your argument, which from my understanding would be only possible in a very limited set of use case scenarios.

Here is a list of other valid arguments you could have made instead:

* Format saves space at the cost of performance.

* Better than repr(packed) as it will also work on platforms that don't support unaligned access

[1] https://github.com/frankmcsherry/columnar/blob/master/exampl... [2] https://github.com/mkaufmann/simple_serialize

mkaufmann | 11 years ago | on: Columnarization in Rust

While I like the construction of the column store and the corresponding API. The claims of the author don't really make sense:

> "... columnarization, a technique from the database community for laying out structured records in a format that is more convenient for serialization than the records themselves."

Column stores in comparison to row stores don't offer any serialization benefit per se. The main benefits are the following, I will be using a record (A,B,C,D,E) as example with all types u32 (4 bytes):

* If you only use some fields you have to load less data from memory/disk into the CPU cache and your working set is more probable to fit into cache. For example when filtering only the records where A=22 and B=45 you only have to actually load x(sizeof(A)+sizeof(B)) = x8 bytes instead of xrecord_size=x20. This can make a very significant difference. * When using compression to reduce the size of data, columns can often be compressed better because they only contain data of the same type and nature and thus probably share similarities. When using such a small record consisting only of integers it probably won't make a difference. But if e.g. some fields are country abbreviations, textual description or others are ids, one could easily imagine that there are gains.

Coming back to the point about serialization, using the same technique as described in the blog post, there won't[1] be a performance difference between column storage and row storage (e.g. using a struct). The method described in the blog post just lets the data array of the original vector be wrapped by a Vec<u8> without even moving the memory, so the method is independent of the data type that is stored in the vectors. Of course it will only work for data types that do not contains references, otherwise we could get illegal memory access after deserialization (which should be guaranteed by the rust type system because only Copy types are allowed).

The only thing this benchmark is testing is how fast a vector can be initialized.

[1] There can be an space improvement of keeping the data in a column layout compared to row layout when using normal structs. Normal structs normally align the total size to the size of the largest field in the struct. A struct containing i64 and i8 would contain 7 bytes of padding. In a column layout this overhead would be avoided. Still there would not be an improvement in this serialization scheme as it does not actually copy any data.

mkaufmann | 11 years ago | on: How shaving 0.001s from a function saved $400/mo on Amazon EC2

Just for fun I tried to estimate the performance characteristics of this service.

My initial assumption is that the 2.2M pages per ~18h are the main workload. This is also supported by the chart at the bottom, outside of the 18h timespan there is hardly any baseload. The blog additionally gives the following facts: 18 c1.medium instances and ~60% utilization after the optimization (taken from the chart).

Now this allows us to calculate the time per page. First the time for the total workload per day is num_machines(cpu_time_per_machine)=18machines(18h*0.6)=194h of processing per day.

On page level this is than 194h/2.2M=317ms per page.

This feels really slow, and should even be multiplied by two to get the time per cpu core (the machines have two cpu cores)! I would guess that the underlying architecture is probably either node.js or ruby. Based on these performance characteristics the minimum cost for this kind of analysis per day is $25. For customers this means that on average the value per 1k analyzed pages should be at least $1.13. I think this is only possible with very selective and targeted scraping, given that this only includes extracting raw text/fragments from the webpages and does not include further processing.

mkaufmann | 11 years ago | on: Higgs JavaScript Virtual Machine

Concerning your critique of the in your opinion overly hard scientific review process. According to her blog the paper was rejected because: "Reviewers at conferences [...] have been very skeptical, and pressed us multiple times to produce some kind of comparison of basic block versioning against tracing."[1] I think this is a very valid concern. When you describe a new scientific approach (Basic Block Versioning) you should compare to the state of the art. Otherwise it is very hard for the reader to judge the merit of the new approach. However I agree with your sentiment that a new approach should not only be judged by performance numbers (there is actually a nice article going deeper on this topic on databasearchitects[2]). But there should at least be a more thorough theoretical discussion than only 3-4 sentences. Benchmarks comparing to existing approaches in this case can help to show pathological cases which might indicate weaknesses of the approach or to empirically demonstrate the feasibility of the new approach.

[1] http://pointersgonewild.wordpress.com/2014/11/14/the-fastest...

[2] http://databasearchitects.blogspot.de/2014/09/experiments-hu...

EDIT: Many conferences also allow publishing papers without a deep comparison to exisiting research in the industrial session. This allows demonstrations of interesting implementation variants or system choices

EDIT 2: The review critisim in older blog posts "Conference reviewers criticized us for not discussing compilation times, and raised the issue that perhaps basic block versioning could drastically increase compilation times." is also very valid for a jit compiler. Again discussion does not have to mean that you have to be faster than all existing systems. Paper acceptance is always a little bit a random process, but at least in this case the review comments are valid from my point of view and her phd advisor should probably have detected these problems in proofreading before submitting. I really hope that the paper will finally be accepted!

page 1