The semantic web is now widely adopted

[+] openrisk|1 year ago|reply

The semantic web standards are sorely lacking (for decades now) a killer application. Not in a theoretical universe of decentralized philosopher-computer-scientists but in the dumbed down, swipe-the-next-30sec-video, adtech oligopolized digital landscape of walled gardens. Providing better search metadata is hardly that killer app. Not in 2024.

The lack of adoption has, imho, two components.

1. bad luck: the Web got worse, a lot worse. There hasn't been a Wikipedia-like event for many decades. This was not pre-ordained. Bad stuff happens to societies when they don't pay attention. In a parallel universe where the good Web won, the semantic path would have been much more traveled and developed.

2. incompleteness of vision: if you dig to their nuclear core, semantic apps offer things like SPARQL queries and reasoners. Great, these functionalities are both unique and have definite utility but there is a reason (pun) that the excellent Protege project [1] is not the new spreadsheet. The calculus of cognitive cost versus tangible benefit to the average user is not favorable. One thing that is missing are abstractions that will help bridge that divide.

Still, if we aspire to a better Web, the semantic web direction (if not current state) is our friend. The original visionaries of the semantic web where not out of their mind, they just did not account for the complex socio-economics of digital technology adoption.

[1] https://protege.stanford.edu/

[+] recursivedoubts|1 year ago|reply

The semantic web has been, in my opinion, a category error. Semantics means meaning and computers/automated systems don't really do meaning very well and certainly don't do intention very well.

Mapping the incredible success of The Web onto automated systems hasn't worked because the defining and unique characteristic of The Web is REST and, in particular, the uniform interface of REST. This uniform interface is wasted on non-intentional beings like software (that I'm aware of):

https://intercoolerjs.org/2016/05/08/hatoeas-is-for-humans.h...

Maybe this all changes when AI takes over, but AI seems to do fine without us defining ontologies, etc.

It just hasn't worked out the way that people expected, and that's OK.

[+] debarshri|1 year ago|reply

At TU delft, I was supposed to do my PhD in semantic web especially in the shipping logistics. It was funded by port of Rotterdam 10 years ago. Idea was to theorize and build various concepts around discrete data sharing, data discovery, classification, building ontology, query optimizations, automation and similar usecases. I decided not to pursue phd a month into it.

I believe in semantic web. The biggest problem is that, due to lack of tooling and ease of use, it take alot of effort and time to see value in building something like that across various parties etc. You dont see the value right away.

[+] ricardo81|1 year ago|reply

There's another element, trusting the data.

Often that may require some web scale data, like Pagerank but also any other authority/trust metric where you can say "this data is probably quality data".

A rather basic example, published/last modified dates. It's well known in SEO circles at least in the recent past that changing them is useful to rank in Google, because Google prefers fresh content. Unless you're Google or have a less than trivial way of measuring page changes, the data may be less than trustworthy.

[+] h4ck_th3_pl4n3t|1 year ago|reply

Say what you want, but Macromedia Dreamweaver came pretty close to being "that killer app". Microsoft attempted the same with Frontpage, but abandoned it pretty quickly as they always do.

I think that Web Browsers need to change what they are. They need to be able to understand content, correlate it, and distribute it. If a Browser sees itself not as a consuming app, but as a _contributing_ and _seeding_ app, it could influence the semantic web pretty quickly, and make it much more awesome.

Beaker Browser came pretty close to that idea (but it was abandoned, too).

Humans won't give a damn about hand-written semantic code, so you need to make the tools better that produce that code.

[+] jancsika|1 year ago|reply

> There hasn't been a Wikipedia-like event for many decades.

Off the top of head...

OpenStreetMap was in 2004. Mastodon and the associated spec-thingy was around 2016. One/two decades is not the same as many decades.

Oh, and what about asm.js? Sure, archive.org is many decades old. But suddenly I'm using it to play every retro game under the sun on my browser. And we can try out a lot of FOSS software in the browser without installing things. Didn't someone post a blog to explain X11 where the examples were running a javascript implementation of the X window system?

Seems to me the entire web-o-sphere leveled up over the past decade. I mean, it's so good in fact that I can run an LLM clientside in the browser. (Granted, it's probably trained in part on your public musing that the web is worse.)

And all this while still rendering Berkshire Hathaway website correctly for many decades. How many times would the Gnome devs have broken it by now? How many upgrades would Apple have forced an "iweb" upgrade in that time?

Edit: typo

[+] rakoo|1 year ago|reply

Over on lobste.rs, someone cited another article retracing the history of the Semantic Web: https://twobithistory.org/2018/05/27/semantic-web.html

An interesting read in itself, and also points to Cory Doctorow giving seven reasons why the Semantic Web will never work: https://people.well.com/user/doctorow/metacrap.htm. They are all good reasons and are unfortunately still valid (although one of his observations towards the end of the text has turned out to be comically wrong, I'll let you read what it is)

Your comment and the two above links point to the same conclusion: again and again, Worse is Better (https://en.wikipedia.org/wiki/Worse_is_better)

[+] austin-cheney|1 year ago|reply

A killer app is still not enough.

People can’t get HTML right for basic accessibility, so something like the semantic web would be super science that people will out of their way to intentionally ignore any profit upon so long as they can raise their laziness and class-action lawsuit liability.

[+] vasco|1 year ago|reply

> There hasn't been a Wikipedia-like event for many decades

I'll give you two examples: Internet Archive. Let's Encrypt.

[+] jl6|1 year ago|reply

Killer applications solve real problems. What is the biggest real problem on the web today? The noise flood. Can semantic web standards help with that? Maybe! Something about trust, integrity, and lineage, perhaps.

[+] echelon|1 year ago|reply

Search and ontologies weren't the only goals. Microformats enabled standardized data markup that lots of applications could consume and understand.

RSS and Atom were semantic web formats. They had a ton of applications built to publish and consume them, and people found the formats incredibly useful.

The idea was that if you ran into ingestible semantic content, your browser, a plugin, or another application could use that data in a specialized way. It worked because it was a standardized and portable data layer as opposed to a soup of meaningless HTML tags.

There were ideas for a distributed P2P social network built on the semantic web, standardized ways to write articles and blog posts, and much more.

If that had caught on, we might have saved ourselves a lot of trouble continually reinventing the wheel. And perhaps we would be in a world without walled gardens.

[+] WolfOliver|1 year ago|reply

Graph Based RAG systems look promising https://www.ontotext.com/knowledgehub/fundamentals/what-is-g...

[+] DrScientist|1 year ago|reply

I think the problem with any sort of ontology type approach is the problem isn't solved when you have defined the one ontology to rule them all after many years of wrangling between experts.

As what you have done is spend many years generating a shared understanding of what that ontology means between the experts. Once that's done you have the much harder task for pushing that shared understanding to the rest of the world.

ie the problem isn't defining a tag for a cat - it's having a global share vision of what a cat is.

I mean we can't even agree on what is a man or a women.

[+] glenstein|1 year ago|reply

I am not sure I understand the fixation on a "killer app" in the context of web standards. We are talking about things like, say, XML, or SVG or HTTP/2. They can have their rationale and their value simply by serving to enable organic growth of a web ecosystem. I think I agree most with your last sentence and should define success more in those terms, aspiring to a better web.

[+] brimwats|1 year ago|reply

I think pointing just to Wikipedia ignores the growing use adoption and massive impact of Wikidata. perhaps I'm biased because of my field but everything I see indicates the growing and not shrinking power of it, I would categorize it as different than Wikipedia's effects though.

[+] EGreg|1 year ago|reply

Why do we need web standards for the semantic web anymore when we have LLMs?

Just make LLMs more ubiquitous and train them on the Web. Rather than crawling or something. The LLMs are a lot more resilient.

[+] cyanydeez|1 year ago|reply

i think you're confused. the killer app is everyone following the same format, and such, capitalists can extract all that information and sell LLMs that no one wants in place of more deterministic search and data products.

[+] bigiain|1 year ago|reply

I laughed at this bit:

"Googlers, if you're reading this, JSON-LD could have the same level of public awareness as RSS if only you could release, and then shut down, some kind of app or service in this area. Please, for the good of the web: consider it."

[+] ubertaco|1 year ago|reply

Well, the immediate initial test failed for me: I thought, "why not apply this on one of my own sites, where I have a sort of journal of poetry I've written?"...and there's no category for "Poem", and the request to add Poem as a type [1] is at least 9 years old, links to an even older issue in an unreadable issue tracker without any resolution (and seemingly without much effort to resolve it), and then dies off without having accomplished anything.

[1] https://github.com/schemaorg/suggestions-questions-brainstor...

[+] npunt|1 year ago|reply

The argument about LLMs is wrong, not because of reasons stated but because semantic meaning shouldn't solely be defined by the publisher.

The real question is whether the average publisher is better than an LLM at accurately classifying their content. My guess is, when it comes to categorization and summarization, an LLM is going to handily win. An easy test is: are publishers experts on topics they talk about? The truth of the internet is no, they're not usually.

The entire world of SEO hacks, blogspam, etc exists because publishers were the only source of truth that the search engine used to determine meaning and quality, which has created all the sorts of misaligned incentives that we've lived with for the past 25 years. At best there are some things publishers can provide as guidance for an LLM, social card, etc, but it can't be the only truth of the content.

Perhaps we will only really reach the promise of 'the semantic web' when we've adequately overcome the principal-agent problem of who gets to define the meaning of things on the web. My sense is that requires classifiers that are controlled by users.

[+] tsimionescu|1 year ago|reply

If even the semantic web people are declaring victory based on a post title and a picture for better integration with Facebook, then it's clear that Semantic Web as it was envisioned is fully 100% dead and buried.

The concept of OWL and the other standards was to annotate the content of pages, that's where the real values lie. Each paragraph the author wrote should have had some metadata about its topic. At the very least, the article metadata was supposed to have included information about the categories of information included in the article.

Having a bit of info on the author, title (redundant, as HTML already has a tag for that), picture, and publication date is almost completely irrelevant for the kinds of things Web 3.0 was supposed to be.

[+] CaptArmchair|1 year ago|reply

I'm a bit surprised that the author doesn't mention key concepts such as linked data, RDF, federation and web querying. Or even the five stars of linked open data. [1] Sure, JSON-LD is part of it, but it's just a serialization format.

The really neat part is when you start considering universal ontologies and linking to resources published on other domains. This is where your data becomes interoperable and reusable. Even better, through linking you can contextualize and enrich your data. Since linked data is all about creating graphs, creating a link in your data, or publishing data under a specific domain are acts that involves concepts like trust, authority, authenticity and so on. All those murky social concepts that define what we consider more or less objective truths.

LLM's won't replace the semantic web, nor vice versa. They are complementary to each other. Linked data technologies allow humans to cooperate and evolve domain models with a salience and flexibility which wasn't previously possible behind the walls and moats of discrete digital servers or physical buildings. LLM's work because they are based on large sets of ground truths, but those sets are always limited which makes inferring new knowledge and asserting its truthiness independent from human intervention next to impossible. LLM's may help us to expand linked data graphs, and linked data graphs fashioned by humans may help improve LLM's.

Creating a juxtaposition between both? Well, that's basically comparing apples against pears. They are two different things.

[1] https://5stardata.info/en/

[+] hmottestad|1 year ago|reply

Metadata in PDFs is also typically based on semantic web standards.

https://www.meridiandiscovery.com/articles/pdf-forensic-anal...

Instead of using JSON-LD it uses RDF written as XML. Still uses the same concept of common vocabularies, but instead of schema.org it uses a collection of various vocabularies including Dublin Core.

[+] mg|1 year ago|reply

The author gives two reasons why AI won't replace the need for metadata:

1: LLMs "routinely get stuff wrong"

2: "pricy GPU time"

1: I make a lot of tests on how well LLMs get categorization and data extraction right or wrong for my Product Chart (https://www.productchart.com) project. And they get pretty hard stuff right 99% of the time already. This will only improve.

2: Loading the frontpage of Reddit takes hundreds of http requests, parses megabytes of text, image and JavaScript code. In the past, this would have been seen as an impossible task to just show some links to articles. In the near future, nobody will see passing a text through an LLM as a noteworthy amount of compute anymore.

[+] atoav|1 year ago|reply

Let's hope you never write articles about court cases then: https://www.heise.de/en/news/Copilot-turns-a-court-reporter-...

The alleged low error rate of 1% can ruin your day/life/company, if it hits the wrong person, regards the wrong problem, etc. And that risk is not adequately addressed by hand-waving and pointing people to low error rates. In fact, if anything such claims would make me less confident in your product.

1% error is still a lot if they are the wrong kind of error in the wrong kind of situation. Especially if in that 1% of cases the system is not just slightly wrong, but catastrophically mind-bogglingly wrong.

[+] zaik|1 year ago|reply

> Reddit takes hundreds of http requests, parses megabytes of text, image and JavaScript code [...] to show some links to articles

Yes, and I hate it. I closed Reddit many times because the wait time wasn't worth it.

[+] esjeon|1 year ago|reply

> I make a lot of tests on how well LLMs get categorization and data extraction right or wrong for my Product Chart (https://www.productchart.com) project.

In fact, what you're doing there is building a local semantic database by automatically mining metadata using LLM. The searching part is entirely based on the metadata you gathered, so the GP's point 1 is still perfectly valid.

> In the near future, nobody will see passing a text through an LLM as a noteworthy amount of compute anymore.

Even with all that technological power, LLMs won't replace most simple-searching-over-index, as they are bad at adapting to ever changing datasets. They only can make it easier.

[+] 8organicbits|1 year ago|reply

Oh nice, Product Chart looks like a great fit for what LLMs can actually do. I'm generally pretty skeptical about LLMs getting used, but looking at the smart phone tool: this is the sort of product search missing from online stores.

Critically, if the LLM gets something wrong, a user can notice and flag it, then someone can manually fix it. That's 100x less work than manually curating the product info (assuming 1% error rate).

[+] intended|1 year ago|reply

Only slightly tongue in cheek, but if your measure of success is Reddit, perhaps a better example may serve your argument?

[+] monero-xmr|1 year ago|reply

LLMs have no soul, so I like content and curation from real people

[+] rapsey|1 year ago|reply

GPU compute price is dropping fast and will continue to do so.

[+] menzoic|1 year ago|reply

How does Product Chart use LLMs?

[+] throwme_123|1 year ago|reply

For my part, I stopped reading at the free bashing of blockchain•.

Reminded me of the angst and negativity of these original "Web3" people, already bashing everything that was not in their mood back then.

• The crypto ecosystem is shady, I know, but the tech is great

[+] hoosieree|1 year ago|reply

> If Web 3.0 is already here, where is it, then? Mostly, it's hidden in the markup.

I feel like this is so obvious to point out that I must be missing something, but the whole article goes to heroic lengths to avoid... HTML. Is it because HTML is difficult and scary? Why invent a custom JSON format and a custom JSON-to-HTML compiler toolchain than just write HTML?

The semantics aren't hidden in the markup. The semantics are the markup.

[+] trainyperson|1 year ago|reply

Are there any tools that employ LLMs to fill out the Semantic Web data? I can see that being a high-impact use case: people don’t generally like manually filling out all the fields in a schema (it is indeed “a bother”), but an LLM could fill it out for you – and then you could tweak for correctness / editorializing. Voila, bother reduced!

This would also address the two reasons why the author thinks AI is not suited to this task:

1. human stays in the loop by (ideally) checking the JSON-LD before publishing; so fewer hallucination errors

2. LLM compute is limited to one time per published content and it’s done by the publisher. The bots can continue to be low-GPU crawlers just as they are now, since they can traverse the neat and tidy JSON-LD.

——————

The author makes a good case for The Semantic Web and I’ll be keeping it in mind for the next time I publish something, and in general this will add some nice color to how I think about the web.

[+] safety1st|1 year ago|reply

Bringing an LLM into the picture is just silly. There's zero need.

The author (and much of HN?) seems to be unaware that it's not just thousands of websites using JSON-LD, it's millions.

For example: install WordPress, install an SEO plugin like Yoast, and boom you're done. Basic JSON-LD will be generated expressing semantic information about all your blog posts, videos etc. It only takes a few lines of code to extend what shows up by default, and other CMSes support this took.

SEOs know all about this topic because Google looks for JSON-LD in your document and it makes a significant difference to how your site is presented in search results as well as all those other fancy UI modules that show up on Google.

Anyone who wants to understand how this is working massively, at scale, across millions of websites today, implemented consciously by thousands of businesses, should start here:

https://developers.google.com/search/docs/appearance/structu...

https://search.google.com/test/rich-results

Is this the "Semantic Web" that was dreamed of in yesteryear? Well it hasn't gone as far and as fast as the academics hoped, but does anything?

The rudimentary semantic expression is already out there on the Web, deployed at scale today. Someone creative with market pull could easily expand on this e.g. maybe someday a competitor to Google or another Big Tech expands the set of semantic information a bit if it's relevant to their business scenarios.

It's all happening, it's just happening in the way that commercial markets make things happen.

[+] conzept|1 year ago|reply

I think the future holds a synthesis of LLM functions with semantic entities and logic from knowledge graphs (this is called "neuro-symbolic AI"), so each topic/object can have a clear context, upon which you can start prompting the AI for the preferred action/intention.

Already implemented in part on my Conzept Encyclopedia project (using OpenAI): https://conze.pt/explore/%22Neuro-symbolic%20AI%22?l=en&ds=r...

Something like this is much easier done using the semantic web (3D interactive occurence map for an organism): https://conze.pt/explore/Trogon?l=en&ds=reference&t=link&bat...

On Conzept one or more bookmarks you create, can be used in various LLM functions. One of the next steps is to integrate a local WebGPU-based frontend LLM, and see what 'free' prompting can unlock.

JSON-LD is also created dynamically for each topic, based on Wikidata data, to set the page metadata.

[+] rchaud|1 year ago|reply

> Googlers, if you're reading this, JSON-LD could have the same level of public awareness as RSS if only you could release, and then shut down, some kind of app or service in this area. Please, for the good of the web: consider it.

Google has been pushing JSON-LD to webmasters for better SEO for at least 5 years, if not more: https://developers.google.com/search/docs/appearance/structu...

There really isn't a need to do it as most of the relevant page metadata is already captured as part of the Open Graph protocol[0] that Twitter and Facebook popularized 10+ years ago as webmasters were attempting to set up rich link previews for URLs posted to those networks. Markup like this:

is common on most sites now, so what benefit is there for doing additional work to generate JSON-LD with the same data?

[0]https://ogp.me/

[+] Devasta|1 year ago|reply

> Before JSON-LD there was a nest of other, more XMLy, standards emitted by the various web steering groups. These actually have very, very deep support in many places (for example in library and archival systems) but on the open web they are not a goer.

If archival systems and library's are using XML, wouldn't it be preferable to follow their lead and whatever standards they are using? Since they are the ones who are going to use this stuff most, most likely.

If nothing else, you can add a processing instruction to the document they use to convert it to HTML.

[+] anonymous344|1 year ago|reply

Well worth, for whom? as a blogger, these things are 99% for the companies making profit by scraping my content, maybe 1% of the users will need them. Or am I wrong?

[+] codelion|1 year ago|reply

I started this thread on the w3c list almost 20 years ago - https://lists.w3.org/Archives/Public/semantic-web/2005Dec/00...

Unfortunately, it is unlikely we will ever get something like a Semantic web. It seemed like a good idea in the beginning of 2000s but now there is honestly no need for it as it is quite cheap and easy to attach meaning to text due to the progress in LLMs and NLP.

[+] kkfx|1 year ago|reply

Ehm... The semantic web as an idea was/is a totally different thing: the idea is the old libraries of Babel/Bibliotheca Universalis by Conrad Gessner (~1545) [1] or the ability to "narrow"|"select"|"find" just "the small bit of information I want". Observing that a book it's excellent to develop and share a specific topic, it have some indexes to help directly find specific information but that's not enough, a library of books can't be traversed quick enough to find a very specific bit of information like when John Smith was born and where.

The semantic web original idea was the interconnection of every bit of information in a format a machine can travel for a human, so the human can find any specific bit ever written with little to no effort without having to humanly scan pages of moderately related stuff.

We never achieve such goal. Some have tried to be more on the machine side, like WikiData, some have pushed to the extreme the library science SGML idea of universal classification not ended to JSON but all are failures because they are not universal nor easy to "select and assemble specific bit of information" on human queries.

LLMs are a, failed, tentative of achieve such result from another way, their hallucinations and slow formation of a model prove their substantial failure, they SEEMS to succeed for a distracted eye perceiving just the wow effect, but they practically fails.

Aside the issue with ALL test done on the metadata side of the spectrum so far is simple: in theory we can all be good citizens and carefully label anything, even classify following Dublin Core at al any single page, in practice very few do so, all the rest do not care, or ignoring the classification at all or badly implemented it, and as a result is like an archive with some missing documents, you'll always have holes in information breaking the credibility/practical usefulness of the tool.

Essentially that's why we keep using search engines every day, with classic keyword based matches and some extras around. Words are the common denominator for textual information and the larger slice of our information is textual.

[1] https://en.wikipedia.org/wiki/Bibliotheca_universalis

[+] grumbel|1 year ago|reply

I don't see how one can have any hope in a Semantic Web ever succeeding when we haven't even managed to get HTML tags for extremely common Internet things: pricetags, comments, units, avatars, usernames, advertisement and so on. Even things like pagination are generally just a bunch of links, not any kind of semantic thing holding multiple documents together (<link rel> exists, but I haven't seen browsers doing anything with it). Take your average website and look at all the <div>s and <span>s and there is a whole lot more low hanging fruit one could turn semantic, but there seems little interest in even trying to.

[+] rakoo|1 year ago|reply

I don't think we necessarily need new tags: they narrow down the list of possible into an immutable set and require changing the structure of your already existing content. What exists instead are microformats (http://microformats.org/wiki/microformats2), a bunch of classes you sprinkle in your current HTML to "augment" it.

[+] Lutger|1 year ago|reply

Everyone is optimizing for their own local use-case. Even open-source. Standards get adopted sometimes, but only if they solve a specific problem.

There is an additional cost to making or using ontologies, making them available and publishing open data on the semantic web. The cost is quite high, the returns aren't immediate, obvious or guaranteed at all.

The vision of the semantic web is still valid. The incentives to get there are just not in place.

[+] druskacik|1 year ago|reply

There's a project [0] that parses Commoncrawl data for various schemas, it contains some interesting datasets.

[0] http://webdatacommons.org/

[+] est|1 year ago|reply

I've playing with RSS feeds recently, suddently it occured to me, XML can be transformed into anything with XSL, for static hosting personal blogs, I can save articles into the feeds directly, then serve frontend single-page application with some static XSLT+js. This is content-presentation separation at best.

Is JSON-LD just reinventation of this?

[+] ttepasse|1 year ago|reply

Back in the optimistic 2000s there was the brief idea of GRDDL – using XSLT stylesheets and XPath selectors for extracting stuff from HTML, e.g. microformats, HTML meta, FOAF, etc, and then transforming it into RDF or other things:

https://www.w3.org/TR/grddl/

[+] martin_a|1 year ago|reply

That is exactly the thought behind SGML/XML and its derivatives. XSL is kind of clumsy but very powerful and the most direct way to transform documents.

JSON-LD to me looks more like trying to glue different documents together, its not about the transformation itself.

[+] rakoo|1 year ago|reply

> This is content-presentation separation at best.

The idea is the best, but arguably the implementation is lacking.

> Is JSON-LD just reinventation of this?

Yup. It's "RDF/XML but we don't like XML"

259 comments