hundredwatt's comments

hundredwatt | 8 months ago | on: Extending That XOR Trick to Billions of Rows

> To be pedantic, not guaranteed. The xor of multiple elements may erroneously have a passing checksum, resulting in an undetected false decode

The false decodes can be detected. During peeling, deleting a false decode inserts a new element with the opposite sign of count. Later, you decode this second false element and end up with the same element in both the A / B and B / A result sets (as long as decode completes without encountering a cycle).

So, after decode, check for any elements present in both A / B and B / A result sets and remove them.

--

Beyond that, you can also use the cell position for additional checksum bits in the decode process without increasing the data structure's bit size. i.e., if we attempt to decode the element X from a cell at position m, then one of the h_i(x) hash functions for computing indices should return m.

There's even a paper about a variant of IBFs that has no checksum field at all: https://arxiv.org/abs/2211.03683. It uses the cell position among other techniques.

hundredwatt | 8 months ago | on: Extending That XOR Trick to Billions of Rows

The graph constructed by using bloom filter-style hash functions supports a decoding process called "peeling" where you:

1. Find a batch with 1 missing element 2. Delete that element from its other assigned partitions 3. Repeat, as the modified batches may now be recoverable

This iterative process (surprisingly!) succeeds with very high probability as long as the number of partitions is 1.22x larger than the number of missing elements with k=3 hash functions.

hundredwatt | 8 months ago | on: Extending That XOR Trick to Billions of Rows

You don't lose absolute guarantees, but the probabilistic nature means the process may fail (in a guaranteed detectable way) in which case you can try again with a larger parameter.

The "bloom filter" name is misleading in regard to this.

hundredwatt | 9 months ago | on: That XOR Trick (2020)

A neat trick to make the accumulator both collision-resistant and self-diagnosing.

  For every normalized link id x:
      y = (x << k) | h(x)   # append a k-bit hash to the id
      acc ^= y

If acc is zero, all links are reciprocal (same guarantee as before).

If acc is non-zero, split it back into (x', h'):

* Re-compute h(x').

* If it equals h', exactly one link is unpaired and x' tells you which one (or an astronomically unlikely collision). Otherwise there are >= 2 problems.

This has collision-resistance like the parent comment and adds the ability to pinpoint a single offending link without a second pass or a hash table.

hundredwatt | 11 months ago | on: A faster way to copy SQLite databases between computers

The recently released sqlite_rsync utility uses a version of the rsync algorithm optimized to work on the internal structure of a SQLite database. It compares the internal data pages efficiently, then only syncs changed or missing pages.

Nice tricks in the article, but you can more easily use the builtin utility now :)

I blogged about how it works in detail here: https://nochlin.com/blog/how-the-new-sqlite3_rsync-utility-w...

hundredwatt | 1 year ago | on: Ask HN: What are you working on? (February 2025)

It’s meant to run in or alongside the pipeline continuously.

Not planning to open source, working on a commercial offering but haven’t launched anything publicly yet.

Would love to hear any more thoughts on the concepts here or my email is in bio

hundredwatt | 1 year ago | on: Ask HN: What are you working on? (February 2025)

I'm building a new tool for end-to-end data validation and reconciliation in ELT pipelines, especially for teams replicating data from relational databases (Postgres, MySQL, SQL Server, Oracle) to data warehouses or data lakes.

Most existing solutions only validate at the destination (dbt tests, Great Expectations), rely on aggregate comparisons (row counts, checksums), or generate too much noise (alert fatigue from observability tools). My tool:

* Validates every row and column directly between source and destination * Handles live source changes without false positives * Eliminates noise by distinguishing in-flight changes from real discrepancies * Detects even the smallest data mismatches without relying on thresholds * Performs efficiently with an IO-bound, bandwidth-efficient algorithm

If you're dealing with data integrity issues in ELT workflows, I'd love to hear about your challenges!

hundredwatt | 1 year ago | on: Ask HN: Books about people who did hard things

+100000

hundredwatt | 1 year ago | on: Kuvasz-streamer: open-source CDC for Postgres for low latency replication

That’s one of the cases where query-based CDC may out perform log-based (as long as you don’t care to see every intermediate change that happened to a row between syncs)

hundredwatt | 1 year ago | on: A High-Level Technical Overview of Homomorphic Encryption

> Much of the FHE industry is focused right now on the subset of applications where legal hurdles make the two available options “slow, but compliant” and “fast, but incurs prison time.”

Anyone have any examples of these applications?

hundredwatt | 6 years ago | on: Ask HN: What does your BI stack look like?

+1 for Metabase

For our team, using an ELT architecture (as opposed to ETL) [1] for managing our data warehouse has greatly reduced the complexity of our data processes. Instead of creating ETLs for every table we want to load into the data warehouse, we create the minimum necessary setup to copy the table into our data warehouse. Then, we write transforms, which are simply SQL statements, to generate wide-column tables that our non-technical users can use to explore data without worrying about joins or having to learn esoteric naming conventions.

Custom EL Scripts -> Redshift -> Transform Statements -> Redshift -> Metabase supports the data needs of all our departments with no dedicated data team members.

[1] https://www.dataliftoff.com/elt-with-amazon-redshift-an-over...

hundredwatt | 9 years ago | on: Ask HN: The habit adopted in 2016 that had the greatest impact on your health?

I had the same problem. In addition to regular exercise, try doing these stretches for once or twice a day: https://www.youtube.com/watch?v=FdNS95hpL-o (they are fun too! Perhaps get a group together at your office to stretch together once a day)

These exercises work to move your body back toward perfect posture, undoing the damage caused by sitting, typing, etc.

I stopped using these in November due to travel. My back and shoulder pain returned. It took about 7 days of consistent stretching for the pain to go away again.

hundredwatt | 14 years ago | on: Hacked: commit to rails master on GitHub

I threw together a quick 'n dirty Rails generator that will generate the code for white/black listing all model attributes with attr_accessible/attr_protected.

Here's the file: https://gist.github.com/1975167, just add to lib/generators in your Rails 3 app, then do rails g mass_assignment_security -h

Hopefully others find this helpful

hundredwatt | 14 years ago | on: Ask HN: Freelancer? Seeking freelancer? (December 2011)

SEEKING FREELANCER - Remote - GaggleAMP.com

GaggleAMP is hiring part-time software developers and UX designers to help us extend our social amplification platform. On the frontend, we use jQuery and HTML5/CSS3 via HAML templates. Our web application's backend stack is Ruby on Rails 3 with MySQL and Redis.

We'll consider hackers with any experience level, intern and up. If interested, send an email with a brief bio and one or more links to past work to jason AT gaggleamp DOT com.

hundredwatt | 14 years ago | on: Ask HN: Who is Hiring? (December 2011)

Remote - GaggleAMP.com

GaggleAMP is hiring part-time software developers and UX designers to help us extend our social amplification platform. On the frontend, we use jQuery and HTML5/CSS3 via HAML templates. Our web application's backend stack is Ruby on Rails 3 with MySQL and Redis.

We'll consider hackers with any experience level, intern and up. If interested, send an email with a brief bio and one or more links to past work to jason AT gaggleamp DOT com.

hundredwatt | 14 years ago | on: Ask HN: Freelancer? Seeking freelancers? (October 2011)

SEEKING FREELANCER web design/UX, remote

GaggleAMP is looking for freelance web designers to work on a per project basis.

Projects will range from add dynamic elements to landing paged TO creating the UI for new application features.

Ability to code HTML/CSS and javascript is a huge plus

Send portfolio to jason at gaggleamp dot com if interested

hundredwatt | 14 years ago | on: Ask HN: Should I work for a startup who's sole purpose is to be flipped?

If it is a smart team working on an interesting/challenging problem and you think you would enjoy it, then the fact that they want to flip the business should only be a secondary consideration.

Just remember to set your expectations correctly about cash/equity incentives: http://www.bothsidesofthetable.com/2010/09/06/how-to-discuss..., http://www.bothsidesofthetable.com/2009/11/04/is-it-time-for...

hundredwatt | 14 years ago | on: Ask HN: Is it feasible to use redis as the only datastore?

Why not?

EDIT: If it meets your application's logical and scaling needs, there's no reason you couldn't use it

hundredwatt | 14 years ago | on: One of Google’s Self-Driving Cars Gets into an Accident

<i>"While Jalopnik believes that the fender-bender is proof that self-driving cars may not be in the best interests of society, we have a different take. There were 10.2 million traffic accidents in 2008, which results in 39,000 deaths. That’s 17.9 people per 100,000 licensed drivers. If Google self-driving cars can beat those statistics, they could actually prevent more accidents than they create. We also waste millions of hours commuting and driving through traffic; imagine if you had that time to be productive instead."</i>

Also, computers can't get drunk...

I'd love to have these brought to the masses. But, no matter what the statistical benefits are (safer and time saving), I can't picture a time in the near future when normal people will be accepting of automated vehicles. Just like this incident almost was, 1 death caused by a self-driving vehicle is going to overshadow 100,000s of deaths by human-driven vehicles. Even once proved feasible, I can't see something like this being adopted in a matter of years or even decades.

Anyone know of any historical examples of a similarly disruptive, but untrusted technology radically changing people's lives? How long did it take for adoption?

hundredwatt | 14 years ago | on: How our SaaS startup got 1000+ signups in just 7 days, without getting Crunched.

Don't think of the $15 theme as permanently defining the look and feel of your site. It needs to be just good enough when you're starting to look like you have some legitimacy. If you're startup is going to be generating any amount of significant revenue for you, after some amount of time, you'll be able to go back and revisit the design.

If you're a programer and not a designer, just think of your design as another optimization point.

Whenever I worry about the design/branding of my current startup, I visit the historical timeline of amazon's logo as a reminder that we'll be able to make it better eventually, but there's more important things to work on now: http://www.kokogiak.com/gedankengang/2004/07/amazoncom-logo-...