The Parser that Cracked the MediaWiki Code

[+] neilk|15 years ago|reply

This isn't the first alternative parser for MediaWiki content -- there are 28 rows in this table. (I just added Sweble's and my own project...)

http://www.mediawiki.org/wiki/Alternative_parsers#Known_impl...

Most of these are special purpose hacks. Kiwi and Sweble are the most serious projects I'm aware of, that have tried to generate a full parse.

However, few of these projects are useful for upgrading Wikipedia itself. Even the general parsers like Sweble are effectively special-purpose, since we have a lot of PHP that hooks into the parser and warps its behaviour in "interesting" ways. The average parser geek usually wants to write to a cleaner spec in, well, any language other than PHP. ;)

Currently the Wikimedia Foundation is just starting a MediaWiki.next project. Parsing is just one of the things we are going to change in major ways -- fixing this will make it much easier to do WYSIWYG editing or to publish content in ways that aren't just HTML pages.

(Obviously we will be looking at Sweble carefully.)

If this sounds like a fun project to you, please get in touch! Or check out the "Future" portal on MediaWiki.org.

http://www.mediawiki.org/wiki/Future

[+] knowtheory|15 years ago|reply

Hey Neilk!

Did you ever turn up anything regarding this? http://news.ycombinator.com/item?id=2216249

btw, neat js parser, i'll have to check it out. :)

[+] sigil|15 years ago|reply

It's great to see people tackling this problem, but I wouldn't declare victory for sweble just yet ("The Parser That Cracked..."). There are other promising MediaWiki parser efforts out there.

For one, sweble is a Java parser, and I'm not sure this makes it a good drop-in replacement for the current MediaWiki PHP code. The DBPedia Project also has what looks like a decent AST-based Java parser [1]. I would be interested in a comparison between sweble and DBPedia's WikiParser.

I stumbled across a very nice MediaWiki scanner and parser in C a while ago [2]. It uses ragel [3] for the scanner; the parser is not a completely generic AST builder, but is rather specific to the problem of converting MediaWiki markup to some other wiki markup. It does do quite a bit of the parser work already though.

Presumably a PHP extension around a C or C++ scanner/parser could someday replace the current MediaWiki parsing code.

[1] http://wiki.dbpedia.org/DeveloperDocumentation/WikiParser?v=...

[2] http://git.wincent.com/wikitext.git

[3] http://www.complang.org/ragel/

[+] ZoFreX|15 years ago|reply

Given the complexity of Wikipedia's deployment compared to a typical MediaWiki installation, it really wouldn't be much effort to hook into a parser in say, Java rather than PHP, and would be well worth doing if it had significant benefits.

Of course, a PHP parser would still have to be maintained in parallel as not everyone would be able to do the Java option.

[+] bjonathan|15 years ago|reply

Site down, here is a mirror: https://www.readability.com/articles/r9i55x6e

cache version: http://webcache.googleusercontent.com/search?q=cache:8xjwEj-...

[+] rwolf|15 years ago|reply

Your readability link redirects me to readabilities home page.

[+] ropers|15 years ago|reply

Or try Coral Cache:

http://dirkriehle.com.nyud.net/2011/05/01/the-parser-that-cr...

"It worked for me." ;-)

[+] car|15 years ago|reply

http://www.sweble.org is the actual Wikitext parser project homepage. Please go there until dirkriehle.com is back up.

[+] sunir|15 years ago|reply

This is a breakthrough and a welcome one. From a end user point of view, it has a couple major implications.

First, I believe this reveals the complexity of the parser, which implies a complex syntax, which implies a complex user interface as felt by end users. A more complex the user interface may make it harder it is to attract new editors, although it's unclear (to me) if that is a fact.

Second, having an AST representation is awesome. It makes it possible to even think about building a path towards WYSIWYG or some other form of rich text editing. It was not really possible to build a WYSIWYG editor around the wiki syntax.

If you have an AST, you can also store the page as the AST since you can regenerate the wiki syntax from the AST for people who need text-based editors.

[+] tokenadult|15 years ago|reply

A more complex the user interface may make it harder it is to attract new editors

There may be friction against gaining new editors from the user interface of the MediaWiki software, but I think the greatest barrier to participation by new editors is the hostile, drama-filled environment on many controversial topics on Wikipedia. My evidence for that is the decline in "unsustainable fashion"

http://strategy.wikimedia.org/wiki/Story_of_Wikimedia_Editor...

in the number of Wikipedian administrators, who presumably for the most part are people who know how to use Wikimedia software. Too many of best contributors (people who look up facts in reliable sources and edit articles for better readability) on Wikipedia feel attacked and that their time is wasted. I know a lot of dedicated hobbyists who quietly work on their hobby-related subjects putting together great articles, but on any subject that is controversial, and for which looking up reliable sources takes some effort, Wikipedia is becoming a war zone and is not improving in quality.

http://strategy.wikimedia.org/wiki/Strategic_Plan/Movement_P...

[+] _delirium|15 years ago|reply

First, I believe this reveals the complexity of the parser, which implies a complex syntax, which implies a complex user interface as felt by end users.

That's the case to some extent, but the opposite is also the case to some extent. Some of the difficulty of parsing is because "ease of human use" has been a much higher priority than "ease of parsing" when discussing syntax, which leads to some constructs that aren't easy to parse with typical CFG-type parsing approaches. It's also designed to be very lenient to ordering and common errors, much like a modern non-strict HTML parser, which makes hand-writing the syntax more friendly and forgiving, but with a tradeoff that the parser has to be more complex, because it doesn't have the luxury of just returning a parse error.

[+] mdaniel|15 years ago|reply

From reading the article, and especially the interesting comments thereon, it seems this problem is half a bogus "language" specification and half that the unwashed masses are inputting any damn thing they like and Wikipedia accepts it.

I suppose this is one of the knobs that must be tuned to balance between reproducible I/O and turning away meaningful contributions from the community.

[+] Semiapies|15 years ago|reply

I hadn't realized that there were any parsing issues around MediaWiki's markup. 5000 lines of PHP? Eek.

[+] sigil|15 years ago|reply

It's worse. The MediaWiki PHP code doesn't implement a proper scanner and parser, it's a bunch of regexes around which the code has grown more or less organically. Silent compensation for mismatched starting and ending tokens abounds, and causes problems for all consumers of the markup, in the same way that lenient HTML parsers have. The difference is that Wikipedia, as the sole channel for editing markup, could have easily rejected syntax errors with helpful messages instead of silently compensating.

If it was anything else, I'd say "who cares," but this is "the world's knowledge" -- we absolutely should care about the format it's stored in. I'm glad to see people tackling this problem.

[+] pornel|15 years ago|reply

AST of an example page is the interesting bit:

http://sweble.org/crystalball/result?query=ASDF&format=t...

[+] car|15 years ago|reply

Site is down due to harddisk problems, but the actually referenced Sweble Wikipedia Parser project site is at http://www.sweble.org.

31 comments