jzimmerman64's comments

jzimmerman64 | 3 years ago | on: Langcc: A Next-Generation Compiler Compiler

> I can say that this is a crux for me because, if you convince me that the linear-time guarantee is this important and that langcc has it

The fact that langcc has the linear-time guarantee follows from the fact that it generates shift/reduce parsers that take steps according to an LR DFA. If you are not already convinced that this property is important, I am not sure that I can convince you, except to ask whether you would be comfortable running, e.g., a worst-case-O(n^3) parser for a real industrial language in production. (Or, for that matter, running a parser such as GLR which may fail with an ambiguity at parse time. langcc parsers, of course, will report potential ambiguity as an LR conflict at _parser generation time_; once langcc produces the LR DFA, its behavior at parse time is unambiguous.)

There are also many other important theoretical innovations in the langcc paper, e.g., the LR NFA optimization procedures, the conflict tracing algorithm, and XLR.

> This is an interesting set of claims because all the arguments I've heard in the past in favor of handwritten parsers were about error messages, not perf. I know langcc has an idea for that as well, but you seem to be focusing on performance as the real breakthrough.

Yes, langcc also features a novel LR conflict tracing algorithm, which makes conflicts very easy to debug at parser generation time. I am not sure if I would call performance the "breakthrough"; more so that linear-time performance is a _prerequisite_, and the theoretical breakthroughs of the langcc paper are what allow us to satisfy that prerequisite.

> Would you accept a generated parser for Ruby, C++, Haskell, Delphi, or Verilog that is empirically within 3x the parsing speed of the handwritten alternative as sufficient evidence that some existing work is already competitive on this front?

I have not studied Ruby, Delphi, or Verilog extensively, so could not comment on those languages at the moment. I know that Golang 1.17.8 is sufficiently difficult to parse as to be a good benchmark. I am not sure whether Haskell is sufficiently difficult to parse, but would at least be _interested_ in results for it. C++ parsing is undecidable, so that would seem to rule it out as a candidate for a fully automatically generated parser.

jzimmerman64 | 3 years ago | on: Langcc: A Next-Generation Compiler Compiler

> everyone in the community who's brought this up has said something along the lines of "the days of handwritten parsers are over."

I am aware of this "consensus", but I believe it is largely wishful thinking. Prior to langcc, the community had yet to propose an alternative that is general and efficient enough (esp., linear-time) to replace hand-written recursive-descent parsers for industrial languages such as Golang. lex and yacc are not sufficient for this. It is only with the advances of the langcc paper (improved optimization procedures, CPS transformation, semantic attributes, etc.) that this has become possible.

> I believe I did in my last comment. The main thing being the duplication of productions. The only interpretation of this that makes sense to me as a claim about the paper (as opposed to a claim about CFGs in general) is that it somehow requires the user to write a production or variants thereof twice, which is false.

My comments were referring to the compilation of tree automata to CFGs, which seems to be necessary in order to use them with standard LR parsing techniques. The alternative seems to be to use the automata to filter sets of possible parses after-the-fact, which does not by itself lead to an efficient parsing algorithm.

> But why Golang, and why comparison to the handwritten parser?

Golang is a representative example of the formidable complexity, and near-ambiguity, of constructs that are commonly used in industrial languages - and which people often take as a justification that handwritten parsers are necessary. In order to be competitive with hand-written parsers (and thus support the "consensus" that hand-written parsers are no longer necessary), a generated parser should at least be guaranteed linear-time, and should be comparable in terms of concrete efficiency benchmarks.

jzimmerman64 | 3 years ago | on: Langcc: A Next-Generation Compiler Compiler

The preprocessor would likely have to be a separate ad-hoc step, but it is possible that langcc could be used to generate a full C frontend if one is willing to run two passes (needed to deal with ambiguous cases such as "x * y", which can be either a multiplication expression or a variable declaration, depending on the contents of the symbol table).

jzimmerman64 | 3 years ago | on: Langcc: A Next-Generation Compiler Compiler

> Generating the tree is trivial.

If by "generating the tree" you mean generating definitions for the AST data structures, this is nontrivial enough that yacc does not do it.

> The lexer is part of the parser.

Again, yacc (a parser generator) does not include a lexer generator; that requires a second package such as lex.

> A pretty printer is not part of a compiler.

A pretty-printer for the IR or target language is an important part of a compiler, as one needs to render the output in a form that some other tool (e.g., an assembler) can accept. A pretty-printer for the source language also comes in handy.

> The rest you have to do by hand.

What you have to do by hand is to write a function from the source AST to the target AST (i.e., a compiler backend). langcc generates the whole frontend, not just the parser.

jzimmerman64 | 3 years ago | on: Langcc: A Next-Generation Compiler Compiler

> back when it was considered impossibly impractical (the "dragon book" said so!)

Yes, with some optimizations (e.g., Pager's algorithm), LR(1) can be practical; in langcc we even implement LR(k), and this is also practical for many real-world grammars.

However, a key observation is that even LR(k) is not enough for many industrial language features. You may enjoy reading the companion technical report for langcc, which delves deep into the underlying parsing theory and proposes a number of new techniques:

https://arxiv.org/pdf/2209.08383.pdf

jzimmerman64 | 3 years ago | on: Langcc: A Next-Generation Compiler Compiler

> Whereas langcc boasts automatically-generated parsers for the easy-to-parse Python and Go, my old employer has automatically-generated parsers for Ruby, C++, COBOL, and about 100 others.

Python may be (comparatively) easy to parse, but Golang is not - it is _barely_ LR(1) with the extensions described in https://arxiv.org/pdf/2209.08383.pdf (notably, semantic attributes and the CPS transformation).

> my old employer has automatically-generated parsers for Ruby, C++, COBOL, and about 100 others

GLR parsing, as I note in the paper, is very different from pure LR(k) parsing. Are the automatically-generated parsers you mention guaranteed to run in linear time? Are they more efficient than hand-written recursive-descent for the same languages?

Not to mention that full C++ parsing, as I also note in the paper, is undecidable, due to simple ambiguities such as "x * y;" (whose parse depends on whether x names a value or a type, which may depend on template computations). So it should be impossible to fully automatically generate a parser for C++ from a declarative BNF spec, without some significant ad-hoc machinery on top.

> he's writing about GLR parsing and Earley parsing as if they're the same. Now that I think about it, I see some resemblance, but they're usually considered quite different.

I'm not calling them "the same"; I'm grouping them together into the same family, as they share some particular characteristics relevant to the paper (polynomial but not necessarily linear time, left-to-right, dynamic-programming-style).

> This has not been true since ANTLR 2, released 25 years ago. The LL() algorithm of ANTLR 3, described in a 2011 publication, is capable of parsing even some context-sensitive grammars.

Again, the implicit claim here is "generated from a declarative BNF spec, efficiently, with guaranteed linear time, parsing full industrial languages...". Of course one can parse many languages with backtracking. Do you have examples of full industrial languages parsed with ANTLR 3 that satisfy the relevant criteria?

> This is one of my favorite papers of the last 5 years, and this is totally inaccurate.

Could you say what, specifically, is inaccurate about it?

> In short, I think the author has the knowledge to claim that his solution beats the best that the 70's and 80's has to offer. I am not motivated to look at this any further because I've been given no reason to believe it's had a fair comparison to what the 2010's has to offer.

Again, I would ask you to exhibit concrete results here. Show me an existing tool that is already capable of generating a parser for Golang 1.17.8, from a declarative BNF spec, with guaranteed linear-time parsing, and that matches or exceeds the hand-written parser in speed.

jzimmerman64 | 3 years ago | on: Langcc: A Next-Generation Compiler Compiler

It's not just a parser generator; it also generates the AST, traversals, lexer, and pretty-printer. And it can generate the AST for your IRs and target language, if you specify them in BNF as well. This is a pretty significant piece of a compiler, all told.

jzimmerman64 | 3 years ago | on: Langcc: A Next-Generation Compiler Compiler

Correct. The C/C++ grammar is not complete; it only encodes the subset of the language that was needed to bootstrap langcc itself.

As far as I am aware, however, the Python grammar parses the full industrial language Python 3.9.12.

jzimmerman64 | 3 years ago | on: Langcc: A Next-Generation Compiler Compiler

As far as we are aware, the grammars included with langcc (go.lang and py.lang) parse the full industrial programming languages Golang 1.17.8 and Python 3.9.12. They are not restricted subsets (even though cc.lang is a restricted subset of C++).

jzimmerman64 | 3 years ago | on: Langcc: A Next-Generation Compiler Compiler

The README is a work in progress, but some of these answers can be found in the documentation (https://arxiv.org/abs/2209.08385). In particular:

> How are the error messages?

langcc produces "confusing input pairs" for LR conflicts, which enable the user to pinpoint the cause of the error much more easily than the opaque shift/reduce errors in tools such as yacc/bison.

> Does it produce source spans for intermediate AST nodes?

Yes! You can find an example in https://github.com/jzimmerman/langcc/blob/main/examples/calc....

> Can it handle operator precedence without encoding the precedence into the grammar?

Yes - detailed in the documentation.

> Can it handle source-code-defined operator precedence?

Depends on what you mean by source-code-defined. The operations supported by langcc's precedence stanzas (detailed in the documentation) are quite comprehensive, and suffice to encode the operator precedence rules in Golang 1.17.8, for example.

jzimmerman64 | 3 years ago | on: Langcc: A Next-Generation Compiler Compiler

> I suggest the developer also look into auto-generated visitor / traversal functions as well

In fact, langcc already has visitor/traversal functions, though they are not so well documented yet. I have just added an example (expr_simpl in https://github.com/jzimmerman/langcc/blob/main/examples/calc...).

> Another suggestion is to make sure the grammar preserves source information (file, line, column span)

This is already present as well (the calc example illustrates this functionality).

> implement re-printing which preserves comments and whitespace

I considered doing this, but it's surprisingly difficult to get the specification right, especially in whitespace-sensitive languages like Python.

page 1