top | item 21476261

Parse, Don’t Validate

642 points| undreren | 6 years ago |lexi-lambda.github.io | reply

230 comments

order
[+] Izkata|6 years ago|reply
This is describing what I've known as "Typed Hungarian notation", and have seen a few times before, though I can't seem to find now.

The original intent of Hungarian notation was to encode metadata into the variable name - for example, "validData = validate(rawData); showValidData(validData)". The idea was that the prefixes would make it just look wrong to the programmer if someone accidentally used mismatched names, such as "showValidData(rawData)", indicating a likely bug.

This notation mutated into the far more simplistic and arguably less useful "prefix with the datatype", such as iFoo indicating Foo is an integer. This mutation became known as Systems Hungarian, while the original became Apps Hungarian.

The suggestion in this post is taking Apps Hungarian, and instead of relying on variable names, encoding it into the type system itself.

The first time I recall seeing something like this suggested was actually in Java, with examples in that blog post involving class attributes:

  public String Name;
  public String Address;
And creating classes to represent these values, preventing them from being used the wrong way:

  public NameType Name;
  public AddressType Address;
...which is why I remember this as "Typed Hungarian" - making the language's type system handle the work that a person would normally have to think about in Apps Hungarian.
[+] radicalbyte|6 years ago|reply
The Java example you've provided is known as Domain Modeling, it's well described (and honestly greatly expanded on) in Domain Driven Design by Eric Evans.

It's applied heavily in the enterprise world simply because it's so powerful. In general as the domain logic becomes more complex the benefits of doing this increase.

Actually, not encountering this style in code-base which is solving a complex problem is a massive warning sign for me. It usually means that concepts are poorly defined and that the logic is scattered randomly all over the code-base.

Apps Hungarian is just a logical style: name functions, variables, types so that they're meaningful within your domain. The result of which is code which is very easy to understand for someone who understands the domain. This doesn't mean long names - if you're doing anything math intensive then using the short names which conform to the norms of the field is perfect. For a business process it probably isn't :)

[+] mason55|6 years ago|reply
Another good example of this is having separate classes for something like unsafe strings vs. safe strings in a web app. The functions which interact with the outside world accept unsafe strings and emit safe strings to the rest of the application. Then the rest of the application only works with safe strings.

Anything that accepts a safe string can make an assumption that it doesn't need to do any validation (or "parsing" in the context of the OP), which lets you centralize validation logic. And since you can't turn an unsafe string into a safe string without sending it through the validator, it prevents unsafe strings from leaking into the rest of the app by accident.

This concept can be used for pretty much anything where you are doing data validation or transformation.

[+] zanny|6 years ago|reply
Rust is extremely good at this with newtype syntax and the ability to apply impls and derives to masking types.

Your JSON data might be a string, but having a Json type that guarantees its valid Json is way better. Same with stringified Base64, a constrained numeric value for a tunable, etc. Because using from impls on these types lets the compiler figure out almost all invalid type usages at compile time and give you eloquent feedback on it.

[+] lmm|6 years ago|reply
Your comment gets the history hilariously backwards.

Actually using the type system has been standard practice in ML-family languages since at least the '70s. Simonyi described Hungarian notation as a way of, effectively, emulating a type system in less-powerful languages; describing Hungarian notation as "prefixing with the type" was not a mistake (as Spolsky claims) but an accurate description of how the technique was intended to be used.

The "Systems Hungarian" mistake came about because C programmers misunderstood the term "type" to refer to something far more limited than it does.

The technique in the article is not "encoding Apps Hungarian into the type system". It's doing the very thing that inspired ("Apps") Hungarian in the first place!

[+] darkkindness|6 years ago|reply
Yes, yes yes! Encode invariants in your data, don't enforce invariants on your data. I particularly think this point needs to be stressed more, because practicably speaking, not every language has the typechecking faculties of Haskell:

> Sometimes, making an illegal state truly unrepresentable is just plain impractical given the tools Haskell provides, such as ensuring an integer is in a particular range. In that case, use an abstract newtype with a smart constructor to “fake” a parser from a validator.

For instance, few type systems will let you encode prime numbers as datatypes, so you'd have to do something in your language like

    newtype Prime = Integer
    mkPrime :: Integer -> Maybe Prime
    mkPrime | isPrime p = Just (Prime p)
            | otherwise = Nothing
which is parsing!
[+] schoen|6 years ago|reply
I learned a lot from this article and hope to revisit it.

I had one quarrel:

> Consider: what is a parser? Really, a parser is just a function that consumes less-structured input and produces more-structured output. By its very nature, a parser is a partial function—some values in the domain do not correspond to any value in the range—so all parsers must have some notion of failure.

I think it's reasonable to include some total functions as parsers as well, because some grammars -- structures that are recognized by the parser and that specify languages -- generate all strings as the corresponding language. In that case, there are no failure cases because every string is a valid input.

I came up with some more-trivial examples, but I think a less-trivial example is the split() function in many languages. Its Haskell type signature is String -> [String], and it is a total function. It handles the case where the input is a nonempty string with no delimiters (the output is a list containing a single token equal to the input), as well as the case where input is an empty string (the output is an empty list), and the case where the input consists only of delimiters (in many implementations, the output is a list containing n+1 empty strings).

Another trivial-ish case is where the grammar allows an arbitrary trailing string following the object of interest (similar to C's atoi). In that case, a parser will always return the empty object if there is no interesting or relevant data in the input. (For example, atoi returns 0 if the input string doesn't begin with an integer.) This could be viewed as a kind of error, but there may be applications in which this behavior is useful.

[+] matheusmoreira|6 years ago|reply
Language-theoretic security was the first thing that came to mind when I read the title. Was pleasantly surprised to see it referenced at the end of the article.

http://langsec.org

The idea is to formally specify the structure of inputs and reject invalid data instead of trying to process it.

[+] iddan|6 years ago|reply
Dear awesome Haskell writers. You are writing great articles but I can’t understand the code examples as Haskell is too far from the programming languages. Can you provide examples in TypeScript / Rust / Swift / Reason or another C like language? If not, I’m curious why? Love, Iddan
[+] QuinnWilton|6 years ago|reply
I really enjoyed this article!

I think that leveraging the type system to enforce invariants about your data model is one of the most effective things you can do as an engineer.

I gave a talk on this topic at my workplace, using Elixir and Haskell as examples. It's a little haphazard, since it wasn't written for public consumption, but someone might find it interesting: https://github.com/QuinnWilton/programming-with-types-talk

[+] verttii|6 years ago|reply
Really good stuff, is that talk uploaded anywhere?
[+] z3t4|6 years ago|reply
Not only was this the first article about Haskel that I could actually understand. But it was also the first article where type annotations makes sense. That it actually helps you think about edge cases, instead of just annoy you.

My personal experience with type annotations (not to be confused with type systems) makes the code harder to read, for very little benefit. I would like a type checker that, instead of you having to babysit the type checker, the type checker should babysit you. Instead of the type checker crying over trivial issues, and needs your comfort. The type checker should comfort me when I'm sad.

While manually writing human friendly error message helps, you still have to discover edge cases yourself. It would be nice with a tool, not necessary baked into the types/system, that automatically detects edge cases. For example detect when a variable has the possibility of becoming undefined/null. Maybe achieved via fussing.

A college once asked me "How do I get rid of the damn nulls?"

Null's are basically a lazy error/exception. Error handling is so laborious we often "forget" to do it. And this is very confusing for beginners. There are basically two camps, one that think errors are exceptions, and the other that thinks errors should be first class citizens. But there are different domains of course. I don't want my CPU to maybe return a error message when I ask it to multiply two numbers, I want that to always work, I don't care if the result get shifted (that would be a job for the type checker to detect, or better; automatically use mult64 instead of mult32), because handling errors at that level and detail would be too annoying and bad for performance. Writing network services however, it is crucial that I get noticed if a request failed.

[+] tejon|6 years ago|reply
For what it's worth: it's _very_ rare for a type annotation to be required in Haskell. It's just considered best practice (supported by compiler warning options) to have them on all top-level declarations, for two reasons:

- it's a good extra layer of _human-readable_ documentation about basic behavior and requirements (as in this article); and

- it lets the compiler give better error messages.

The compiler is _always_ going to do its own inference, whether you give it a signature or not. If it infers a valid type _which isn't the one you thought you wrote_, and you didn't specify what the type _should_ be, that function/statement will compile -- you won't get an error until later, when you attempt to consume it and the types don't match. This can be harder to trace, especially if there's a bunch of polymorphic code in the middle where any type will pass. Supplying a manual annotation lets the compiler immediately say "hey, these don't match."

[+] hcarvalhoalves|6 years ago|reply
I would like to see an example on something less trivial than encoding `not empty` on the type. At some point the difference between what the author calls "parsing" and "validating" gets blurry, or the type system can't encode.
[+] danharaj|6 years ago|reply
You can always use abstract data types to indicate in the type system that you know something that can't directly be expressed. The point is not that "parsing" and "validating" are distinct concepts, but that they're two different points of view on the same problem.
[+] beders|6 years ago|reply
It's pretty cute. Still no static types for "Strings that end with a dot" or "Strings that match [a-z]+" or "Strings that are proper nouns". ;)

I'm not sure I understand the benefits of this compared to a traditional parser generator.

"Use a data structure that makes illegal states unrepresentable." is not effort spent well. (And given Gödels incompleteness theorems unachievable in the general case.)

A proper parser generator using a grammar will give you the structural guarantees you need, a lexer will reject invalid terms. And a transformation step gives you the target structure. Eventually you will have to validate though at runtime.

[+] lexi-lambda|6 years ago|reply
> Still no static types for "Strings that end with a dot" or "Strings that match [a-z]+"

Sure there are. :) Technically speaking, anyway. Here’s a type for “strings that end with a dot”:

  -- | A string that ends in the character '.',
  -- represented as a string without its trailing dot.
  newtype StringEndingInDot = StringEndingInDot String

  fromString :: String -> Maybe StringEndingInDot
  fromString str = case reverse str of
    '.':cs -> Just $ StringEndingInDot (reverse cs)
    _      -> Nothing

  toString :: StringEndingInDot -> String
  toString (StringEndingInDot str) = str ++ "."
And here’s a type for “strings that match [a-z]+”:

  data Letter = A | B | C | D | ... | X | Y | Z
  type StringAZ = NonEmpty Letter
Now, admittedly, I would never use either of these types in real code, which is probably your actual point. :) But that’s a situation where there is a pragmatic “escape hatch” of sorts, since you can create an abstract datatype to represent these sorts of things in the type system without having to genuinely prove them:

  module StringEndingInDot(StringEndingInDot, fromString, toString) where
    newtype StringEndingInDot = StringEndingInDot { toString :: String }

    fromString :: String -> Maybe StringEndingInDot
    fromString str = case reverse str of
      '.':_ -> Just $ StringEndingInDot str
      _     -> Nothing
You may rightly complain that’s a lot of boilerplate just to represent this one property, and it doesn’t compose. That’s where the “Ghosts of Departed Proofs” paper (https://kataskeue.com/gdp.pdf) cited in the conclusion can come in. It provides a technique like the above that’s a little more advanced, but brings back composition.
[+] lukebitts|6 years ago|reply
> It's pretty cute. Still no static types for "Strings that end with a dot" or "Strings that match [a-z]+" or "Strings that are proper nouns". ;)

That’s not true, you can wrap these in types that can only be constructed from valid data, and from that point on you can be sure everything is correct.

[+] cinnamonheart|6 years ago|reply
Idris:

  data EndsWith : Type where
    EndsWithPeriod : (s: String) -> { auto prf : ("." `Strings.isSuffixOf` s = True) } -> EndsWith

  s : EndsWith
  s = EndsWithPeriod "."

  -- Value of type False = True cannot be found
  -- s2 : EndsWith
  -- s2 = EndsWithPeriod ""
The noun one might be possible -- I'm honestly less familiar with what a proper noun is offhand. (E: prefix=>suffix)
[+] papln|6 years ago|reply
> "Use a data structure that makes illegal states unrepresentable." is not effort spent well.

Am I to presume you do all your programming in assembly (or opcodes?), since you never need to distiguish integers from strings in your source code?

> Gödels incompleteness theorems

is never relevant in the real world.

In the general case, all real computers are broken.

[+] wvenable|6 years ago|reply
Ignoring all the convoluted type examples here, the simplest case is that you create the type that matches the parsed value. For example, you can create an EmailAddress class that can only contain valid email addresses. Instead of String, you use EmailAddress whenever you need to ensure the valid parsed value.
[+] Xophmeister|6 years ago|reply
Proper nouns are finite, albeit quite numerous, so these _could_ be enumerated as a sum type.
[+] magical_h4x|6 years ago|reply
Nice article! It got me thinking about an issue I've noticed in dynamically typed languages (namely JavaScript) where it's very easy to end up doing lots of (potentially redundant) validation all over the place because it's much more difficult / unwieldy to pass that information along.
[+] dudul|6 years ago|reply
Everybody notices this. That's why most dynamically typed languages, after having reached good adoption, try to shoehorn type safety to fix that.
[+] ricardobeat|6 years ago|reply
In this case not even TypeScript can rescue you. You might naively implement:

    const head = (input: A[]) => input[0]
and it will happily infer a return type of A. Then you'll fall flat on your face the first time you encounter an empty array:

    head([]) // -> undefined
To make it correct you need to explicitly define a return type of (input: A[]): A | undefined, just as the Maybe. It's obviously impossible for TS to guarantee that an array will not be empty at runtime, but I wish this specific case triggered some help from the type checker.
[+] BurningFrog|6 years ago|reply
To me, unit tests replaces almost all I got from type safety. And that's just a side effect even!
[+] mamcx|6 years ago|reply
Yep. For example you call:

    config.load_from_file(....
And you can't know if it check if the file exist first or not. From pure habit you add checks... (what? looking at the code of the library? thats crazy!).
[+] undreren|6 years ago|reply
Yeah, the author mentions that too.

For me, the funny thing about this article is that I kind of had the core insight just yesterday while hacking together a cli framework, but like the author, I couldn't quite explain what I learned.

[+] mehrdadn|6 years ago|reply
Am I right to suspect the only reason "there can’t be any performance overhead" is that this is being done in a lazy language like Haskell? Meaning the statement won't hold in >99% of practical cases? Or did I misunderstand something?
[+] spuz|6 years ago|reply
No, this doesn't have anything to do with Haskell's lazy evaluation (at least not in the NonEmpty list example presented). The general idea is that if you are going to perform validation checks at some point in your dynamic language, you won't lose any performance performing those checks up-front in your static language.
[+] larusso|6 years ago|reply
I understand the concept of wrapping values in specific types which gives you certain guarantees at compile time. And I really like this concept and will play around with this some more because the empty something issue is something I myself struggle with. But what really urgs me is the usage of throw in the exceptional case. My goto type in these situations would be Either rather than a throw. But this would create nearly the same issues on the caller site as an Maybe would create. Now one could argue that this is an exceptional case the user or Programm can’t possibly handle. So how do you handle this then? My main usage is in Rust and here the panic! handle seems to be used as often as unsafe raw pointers :)
[+] giomasce|6 years ago|reply
Using this programming style requires a rather powerful type system, if you want to go past the simple examples the author shows and encode more complex validity conditions. I am still learning these things, but AFAIU the extreme edition of all this is the Calculus of Constructions used in Coq and Lean, where you can encode theorems and algorithm specifications as very complicated type; proving a theorem corresponds to showing an element of the theorem itself (seen as a type), and if you find an element of a type encoding a certain specification, Coq can work out for you a program that implements that specification.

This is what I understood, at least. Things are quite complicated!

[+] stoops|6 years ago|reply
A pragmatic interpretation of this is to:

* subclass or compose String whenever possible. (Id, Name, Address, AddressLine1)

* subclass or compose Number/Matrix types whenever possible (e.g. Users, Packages, Widgets)

* Use immutable collections (e.g. Google Guava library in Java)

I have built very powerful software with small teams using these principles.

At scale, your day to day programming becomes easy as the compiler is doing all the error checking.

It is also very slow to program this way. You write A LOT of boilerplate, and there's only so many code-generation libraries you can throw at it before it becomes a pain to setup your IDE.

But it is worthwhile for complex applications. We did not have bugs.

[+] chrisdone|6 years ago|reply
Ghost of Departed proofs provides a fairly trivial way to generate proofs from arbitrary code and track them in the type system. See this article for example: https://ocharles.org.uk/blog/posts/2019-08-09-who-authorized...

This post shows a simple "positive" assertions, without any combinatorial logic associated with the proofs. But this is similar to programming by contracts.

[+] QuinnWilton|6 years ago|reply
> Using this programming style requires a rather powerful type system, if you want to go past the simple examples the author shows and encode more complex validity conditions.

You need a strong type system if you want to encode complex constraints at compile time, but you can still encode a surprising number of conditions in weaker type systems. And if you can't enforce a constraint you care about at compile time, capturing that constraint in an ADT using sum types and runtime assertions can still provide a lot of value to anyone reading or maintaining your code.

This sort of style requires a lot more diligence and verbosity than the same code in Haskell would, but I think dynamically typed programs can make a lot more use of their type system than they usually do.

At the very least, humans can try to pick up the slack where a compiler is insufficient, and while designing your data model in Ruby as if it were in Haskell may not help the interpreter at all, it can give the programmer a lot more context about how the code is intended to work, and what assumptions it is making about its data.

[+] wvenable|6 years ago|reply
I think it's a pragmatic style even with languages with much less powerful type systems. You might have to do more of the work yourself -- manually creating types, for example -- but the principle is good.

I think it's advice you don't have to go all in on -- you get benefits even if you use it just a little.

[+] jillesvangurp|6 years ago|reply
Kotlin's smart cast is nice in this context.

One example is with nullable types. if you have val foo: Whatever? // might be null and Whatever and Whatever? are two separate types. foo.someMethod()// compilation error because foo is not of type Whatever

if(foo != null) { foo.someMethod() // works because foo was smart cast to Whatever after the null check }

In the same way doing a switch on a type smart casts the variable to that type. Smart casting of nullable types also gets rid of a lot of clumsy things like Optional, Maybe, etc. you'd need in other languages where you'd have to call a method to extract the value, assign it to a new variable, etc.

[+] z3t4|6 years ago|reply
What is the equivalent in TypeScript? const head = (arr: Array<number>):number => arr[0]; happily returns undefined if you pass an empty array (with strict null checks)
[+] lexi-lambda|6 years ago|reply
TypeScript is sadly very unsound by design, so doing this kind of thing in TypeScript is always going to be more “best effort” and less “exhaustive proof.” That’s not necessarily wrong or bad per se, as after all, Haskell has escape hatches, too (and so do dependently-typed languages, for that matter). However, there are philosophical differences between the way the languages’ type systems are designed.

When I’ve talked to people who develop and use TypeScript, the general impression I’ve gotten from them is that TS is actually as much about tooling as it is about correctness. TS enables things like autocomplete, jump to definition, and quick access to documentation, and it does that by leveraging the typing information to “see through” the dynamic dispatch that makes that sort of thing infeasible in dynamically-typed JavaScript. The TS type system does provide some correctness benefits, don’t get me wrong, but where ease of use and correctness are in conflict, usually ease of use is preferred.

This is true even with all the “strictness” options enabled, like strict null checking. For some examples of TypeScript unsoundness, see [1] and [2]. Flow is actually generally a lot better than TS in this regard (though it does still have some major soundness holes), but it looks like TS has basically “won,” for better or for worse. But again: TS won because its tooling was better, not because its ability to actually typecheck JS programs was better, which I think reinforces what I said in the previous paragraph on what TS is really about.

[1] https://twitter.com/lexi_lambda/status/1068704405124452352

[2] https://twitter.com/lexi_lambda/status/1068705142109810688

[+] scrollaway|6 years ago|reply
This SO answer is pretty clever: https://stackoverflow.com/a/56006703

type NonEmptyArray<T> = [T, ...T[]];

That is: A type of "NonEmptyArray" of T is an array of one element of type T, followed by any number of type-T elements.

[+] demurgos|6 years ago|reply
I recently [0] updated my Control Flow Graph representation in a Typescript parser for AVM1 bytecode because I had the exact same issue as in the article. I previously represented my graph as an array of blocks. But a graph always has at least one block (the entry point). The old representation forced me to check for the empty case despite knowing that valid graphs always have at least one block.

Here is the old representation:

    export interface Cfg {
      blocks: CfgBlock[];
    }
And here is the new one:

    export interface Cfg {
      head: CfgBlock;
      tail: CfgBlock[];
    }
Switching to this new representation allowed to then simplify code manipulating this graph. This representation is also simple enough that most schema validators can represent it.

You still have to note that with this representation you no longer have a single array. It's not an issue in my case but your situation may vary. You may try the tuple-based solution mentioned in a sibling comment if you need to have a single array.

[0] https://github.com/open-flash/avm1-types/commit/ec85efab8c9e...

[+] anasbarg|6 years ago|reply
When we started writing the code for Heavenly-x (https://heavenlyx.com), the first thing we needed to write before anything else is the definition of a data structure that represents the requirements as close as possible (a concrete syntax tree).

We’re building a tool with a custom DSL for building CRUD GraphQL APIs, with auth and data validation and transformation. So our architecture consists of three phases: - Parser - Setup - Runtime

There’s no way the parser would succeed if the input is not valid. We’re writing our software in Scala and we are using parboiled2 for parsing the DSL into a concrete syntax tree, so if it succeeds then it’s valid and if it fails, it fails early and we don’t have to worry about validation in later phases. We wrote some custom validation logic that traverses the concrete syntax tree after the parser to validate for some requirements that we couldn’t encode in the concrete syntax tree, but it’s really a small portion of our codebase and it’s easy to maintain.

At the Setup and the Runtime phase we assume that the concrete syntax tree is valid.

At the Runtime phase we have a running GraphQL server so we have a parsing phase too but it’s handled by Sangria so we don’t have to worry about it.

We are also building a UI for those who don’t like using our language. It’s a React app where the state data structure looks exactly like our concrete syntax tree.

We’re launching soon. You can subscribe for early access here: https://heavenlyx.com/get-started

[+] jdonaldson|6 years ago|reply
Haxe has a nice mechanism to handle this called "abstract types". https://haxe.org/manual/types-abstract.html

The critical validation step happens when the abstract type is created, not when it is used or passed to other functions...similar to the example in TFA.

The added benefit is that you get a type that behaves pretty closely to a class, but adds no typing or runtime memory overhead. E.g. here's an example for an "unsafe" string : https://gist.github.com/jdonaldson/9a09287e540e7392857f

Another benefit is that you abstract types can define relationships between multiple types in this way, making it possible to perform validation for functions that need to handle "Either<T>"-style types.

[+] cryptica|6 years ago|reply
It doesn't make sense to use JSON as the interchange format for a statically typed language. There is an impedence mismatch between the two. You are forced to infer types which are not actually there.

The reason why JSON is so popular is the same reason why dynamic typed languages became popular. The interfaces between components are simple and human-readable. A library written in a dynamically typed language can be effectively integrated into any system without the possibility of type mismatches.

If you have an API written with Protocol Buffers, every system which interacts with yours needs to agree with your type system; this creates tighter coupling.

[+] whateveracct|6 years ago|reply
Doesn't this hold true for all serialization ever? Replace JSON with "bytes" (aka any serialized data) and it still holds:

It doesn't make sense to use bytes as the interchange format for a statically typed language. There is an impedence mismatch between the two. You are forced to infer types which are not actually there.

[+] johnday|6 years ago|reply
> It doesn't make sense to use JSON as the interchange format for a statically typed language.

Two problems here. One, there are multiple statically typed languages with different type systems and it is nontrivial to translate between them. It's not "having a language" versus "not having a language"; it's Type system A vs Type system B vs Type system C vs none.

Secondly, this is only true if you assume that no dynamically typed language will ever want to interoperate with your code - which is almost always incorrect to presume.