CauseNet: Towards a causality graph extracted from the web

[+] rwmj|6 months ago|reply

Isn't this like Cyc? There have been a couple of interesting articles about that on HN:

https://news.ycombinator.com/item?id=43625474 "Obituary for Cyc"

https://news.ycombinator.com/item?id=40069298 "Cyc: History's Forgotten AI Project"

[+] HarHarVeryFunny|6 months ago|reply

Seems like a subset of CYC - attempting to gather causal data rather than declarative data in general.

It's a bit odd that their paper doesn't even mention CYC once.

[+] TomasBM|6 months ago|reply

Cyc is hardly [1] mentioned in modern work under the knowledge representation and reasoning umbrella, because most [2] of it was/is unavailable or unknown to most researchers. It's hard to build on something that's primarily marketing material.

[1] I could be wrong, but even those that mention Cyc use it only as a historical example of early work in KRR / symbolic AI. [2] OpenCyc being the small subset which is available, tho I haven't met anyone who worked with it.

[+] 2OEH8eoCRo0|6 months ago|reply

Everything old is new again

[+] pavlov|6 months ago|reply

The sample set contains:

    {
        "causal_relation": {
            "cause": {
                "concept": "boom"
            },
            "effect": {
                "concept": "bust"
            }
        }
    }

It's practically a hedge-fund-in-a-box.

[+] kolektiv|6 months ago|reply

Plus, regardless of what you might think of how valid that connection is, what they're actually collecting, absent any kind of mechanism, is a set of all apparent correlations...

[+] kruffalon|6 months ago|reply

I read it as "casual" rather than "causal", got very dissapointed while reading the article!

An inventory of casual knowledge would be really fun, although it's hard to think what it would consist of now that I think about it...

There is this concept of "hidden knowledge" about all the things you know at work that no one really thinks about is knowledge so it's hard to let newcomers know about it.

But that does sound different than "casual knowledge", and so does "trivia".

Oh well!

[+] eloeffler|6 months ago|reply

I did, too! And it reminded me of a project idea I had a while ago:

A time traveller's wiki that collects casual knowledge for different times (and different places).

Such as: "Buying a train ticket in Paris in 1972".

But it was a shower thought and it's pretty hard to imagine how this knowledge should be collected and especially presented.

In a way, wikipedia is already doing this by keeping records of articles as they change over the years :)

The article about train tickets wasn't so good as an example but "computer monitor" from 2004 is kind of fun to read :)

Unfortunately, "casual knowledge" is often omitted when writing informative articles. In this example, there is no mention that power buttons are often located somewhere in the back of the monitor, which was good to know in 2004. Also, some monitors are drawing power from the computer, thus they won't power up before the computer will. And speaking of that: You may want to turn of your computer after shutdown!

Edit: This would probably be useful for novelists and filmmakers (in addition to the casual time traveller)

[+] treetalker|6 months ago|reply

Could you be referring to what is known as tacit knowledge?

[+] aleatorianator|6 months ago|reply

it's as simple as precisely describing "common sense"

[+] tgv|6 months ago|reply

This makes little sense to me. Ontologies and all that have been tried and have always been found to be too brittle. Take the examples from the front page (which I expect to be among the best in their set): human_activity => climate_change. Those are such a broad concepts that it's practically useless. Or disease => death. There's no nuance at all. There isn't even a definition of what "disease" is, let alone a way to express that myxomatosis is lethal for only European rabbits, not humans, nor gold fish.

[+] jiggawatts|6 months ago|reply

Even more importantly, it's not even a simple probability of death, or a fraction of a cause, or any simple one-dimensional aspect. Even if you can simplify things down to an "arrow", the label isn't a scalar number. At a bare minimum, it's a vector, just like embeddings in LLMs are!

Even more importantly, the endpoints of each such causative arrow are also complex, fuzzy things, and are best represented as vectors. I.e.: diseases aren't just simple labels like "Influenza". There's thousands of ever-changing variants of just the Flu out there!

A proper representation of a "disease" would be a vector also, which would likely have interesting correlations with the specific genome of the causative agent. [1]

Next thing is that you want to consider the "vector product" between the disease and the thing it infected to cater for susceptibility, previous immunity, etc...

A hop, skip, and a small step and you have... Transformers, as seen in large language models. This is why they work so well, because they encode the complex nuances of reality in a high-dimensional probabilistic causal framework that they can use to process information, answer questions, etc...

Trying to manually encode a modern LLM's embeddings and weights (about a terabyte!) is futile beyond belief. But that's what it would take to make a useful "classical logic" model that could have practical applications.

Notably, expert systems, which use this kind of approach were worked on for decades and were almost total failures in the wider market because they were mostly useless.

[1] Not all diseases are caused by biological agents! That's a whole other rabbit hole to go down.

[+] __alexs|6 months ago|reply

Given we've tried to develop such ontologies constantly for thousands of years now. What do you think the cause for such hopeless optimism might be? If only we had a database of causal relationships to consult...

[+] dr_dshiv|6 months ago|reply

Democritus (b 460BCE) said, “I would rather discover one cause than gain the kingdom of Persia,” which suggests that finding true causes is rather difficult.

[+] DrScientist|6 months ago|reply

I totally agreed that in the past years of hammering out an ontology for a particular area just results in a common understanding between those who wrote the ontology and a large gulf between them and the people they want to use it ( everyone else ).

What's perhaps different is that the machine, via LLM's, can also have an 'opinion' on meaning or correctness.

Going fully circle I wonder what would happen if you got LLM's to define the ontology....

[+] notrealyme123|6 months ago|reply

Koller and Friedman write in "Probabilistic Graphical Models" about the "clarity test", so that state variables should be clear for an all seeing observer.

States like "human_activity" are not objectively measurable.

Fairly PGMs and causal models are not the same, but this way of thinking about state variables is an incredible good filter.

[+] tomaskafka|6 months ago|reply

But “disease => death” + AI => surely at least few billion in VC funding.

[+] koliber|6 months ago|reply

Exactly. In some cases disease causes death. In others it causes immunity which in turn causes “good health” and postpones death.

[+] tossandthrow|6 months ago|reply

Ontology, not ontologies, have been tried.

We have quite a good understanding that a system cannot be both sound a complete, regardless people went straight in to make a single model of the world.

[+] nurettin|6 months ago|reply

I have some faith in this process. With enough facts, you get contradictions. Weighing contradicting vectors is a way of making decisions. So overall collecting a bunch of weakly connected facts might actually be useful. I'd like to see that in action.

[+] asplake|6 months ago|reply

Agreed. About the strongest we can hope for are causal mechanisms, and most of those will be at most hypotheses and/or partial explanations that only apply under certain conditions.

Honestly, I don’t know understand how these so-ontologies have persisted. Who is investing in this space, and why?

[+] CuriouslyC|6 months ago|reply

It's pretty easy to outline a high level ontology and let LLMs annotate/link it into something pretty useful, you can even have a benchmark suite using that ontology via LLM as a judge to progressively optimize it.

[+] SilverElfin|6 months ago|reply

What is an ontology exactly? I see Palantir talking about it all the time and it just sounds like vague marketing.

[+] vintermann|6 months ago|reply

As I understand it, this is a dataset of claimed causation. It should contain vaccines->autism, not because it's true, but because someone, in public, claimed that it was.

So, by design, it's pretty useless for finding new, true causes. But maybe it's useful for something else, such as teaching a model what a causal claim is in a deeper sense? Or mapping out causal claims which are related somehow? Or conflicting? Either way, it's about humans, not about ontological truth.

[+] TomasBM|6 months ago|reply

I'm actively working with ontologies (disclaimer: as a researcher), and yours is the top comment, so I'll try to make some counterclaims here. No relation to this work tho.

> Ontologies and all that have been tried and have always been found to be too brittle.

I'd invite you to look at ontologies as nothing more than representations of things we know in some text-based format. If you've ever written an if statement, used OOP, trained a decision tree, or sketched an ER diagram, you've also represented known things in a particular text-based format.

We probably can agree that all these things are ubiquitous and provide value. It's just that those representations are not serialized as OWL/RDF, claim less about being accurate models of real-world things, and are often coupled with other things (i.e., functions).

This may seem reductionist in the sense of "we're all made of atoms", but I think it's important to understand why ontologies as a concept stick: they provide atomic components for expressing any knowledge in a dedicated place, and reasoning about it. Maybe the serializations, engines, results or creators suck, or maybe codebase + database is enough for most needs, but it's hard to not see the value of having some deterministic knowledge about a domain.

If you take _ontology_ to mean OWL/RDF, this paper wouldn't qualify, so I'm assuming you took the broader meaning (i.e., _semantic triples_).

> Take the examples from the front page (which I expect to be among the best in their set)

Most scientific work will be in-progress, not WordNet-level (which also needs a lot of funding to get there). You ideally want to show a very simple example, and then provide representative examples that signal the level of quality that other contributors/scientists can expect.

Here, they're explicit about creating triples of whatever causal statements they found on Wikipedia. I wouldn't expect it to be immediately useful to me, unless I dedicate time to prune and iron out things of interest.

> human_activity => climate change. Those are such a broad concepts that it's practically useless.

Disagree. If you had one metric that aggregated different measurements of climate change-inducing human activity, and one metric that did the same for climate change, you could create some predictions about N-order effects from climate change. Statistical analysis anyway requires you to make an assumption about the causal relationship behind what you're investigating.

So, if this the level of detail you need, this helps you potentially find new hypotheses just based on Nth order causal relations in Wikipedia text. It's also valuable to show where there is not enough detail.

> Or disease => death. There's no nuance at all.

Aside from my point above - haven't looked at the source data, but I doubt it stops at that level. But even if it does, it's 11 million things with provenance you can play with or add detail to.

Or you can also show that your method or choice of source data gets more conceptual/causal detail out of Wikipedia, or that their approach isn't replicable, or that they did a bad job, etc. These are all very useful contributions.

[+] cantor_S_drug|6 months ago|reply

[deleted]

[+] mark_l_watson|6 months ago|reply

This might be of at least some value to augment training LLMs? I spent a lot of time in the 1980s and early 1990s using symbolic AI techniques: conceptual dependency, NLP, expert systems, etc. While two large and well funded expert system projects I worked on (paid for by DARPA and PacBell) worked well, mostly symbolic AI was brittle and required what seemed like an i finite amount of human labor.

LLMs are such a huge improvement that the only real use I see in projects like Cause et, the defunct OpenCyc project, etc. the only possible practical use might be as a little extra training data.

[+] refactor_master|6 months ago|reply

Might as well go ahead and add https://tylervigen.com/spurious-correlations?page=135 from the looks of it.

[+] TofuLover|6 months ago|reply

This reminds me of an article I read that was posted on HN only a few days ago: Uncertain<T>[1]. I think that a causality graph like this necessarily needs a concept of uncertainty to preserve nuance. I don't know whether this would be practical in terms of compute, but I'd think combining traditional NLP techniques with LLM analysis may make it so?

[1] https://github.com/mattt/Uncertain

[+] bbor|6 months ago|reply

> CauseNet aims at creating a causal knowledge base that comprises all human causal knowledge and to separate it from mere causal beliefs

Pretty bold to use a picture of philosophers as your splash page and then make a casual claim like this. To say the least, this is an impossible task!

The tech looks cool and I'm excited to see how I might be able to work it into my stuff and/or contribute. But I'd encourage the authors to reign in the rhetoric...

[+] johnecheck|6 months ago|reply

Indeed. I can't take an epistemology project seriously if it has no humility.

Building a perfectly accurate model of the world isn't possible. We need to create tools that make it easier for regular people to build more accurate models, not delude ourselves with dreams of perfection.

[+] larodi|6 months ago|reply

Why not use PROLOG then, is the essence of cause and effect in programming. And also can expound syllogisms.

[+] orobus|6 months ago|reply

The conditional relation represented in prolog, and in any deductive system, is material implication (~PvQ), not causation. You can encode causal relationships with material implication but you’re still going to need to discover those causal relationships in the world somehow.

[+] rhizome|6 months ago|reply

"The map is not the territory" ensures that bias and mistakes are inextricable from the entire AI project. I don't want to get all Jaron Lanier about it, but they're fundamental terms in the vocabulary of simulated intelligence.

[+] circlemaker|6 months ago|reply

This made me think of a much more interesting project. A compendium of information automatically extracted from research articles.

Essentially one totalizing meta analysis.

E.g. If it reads an article about the relationship between height and various life outcomes in Indonesian men, then first, it would store the average height of Indonesian men, the relationship between the average height of Indonesian men and each life outcome in Indonesian men, the type of relationship (e.g. Pearson's correlation), the relationship values (r value), etc. It would store the entity, the relationship, the relationship values, and the doi source.

Something like a quantitative Wikipedia.

[+] jack_riminton|6 months ago|reply

Reminds me of the early attempts at hand categorising knowledge for AI

[+] koliber|6 months ago|reply

I wonder how they will quantize causality. Sometimes a particular cause has different, and even opposite, effects.

Alcohol causes anxiety. At the same time it causes relaxation. These effects depend on time frame, and many individual circumstances.

This is a single example but the world is full of them. Codifying causality will involve a certain amount of bias and belief. That does not lead to a better world.

[+] fohara|6 months ago|reply

The associated paper references Judea Pearl's theories on causality, but curiously doesn't mention the DoWhy implementation [0], which seems to have some recognition in the causal inference space.

[0] https://github.com/py-why/dowhy

[+] sinuhe69|6 months ago|reply

I find the simple expression of a causes b as in this database without qualification not very helpful. At least, we need causal graphs/causal digram loops to describe these causal relationships better.

[0] https://en.wikipedia.org/wiki/Causal_graph

Harvard has a free course about it: https://www.edx.org/learn/data-analysis/harvard-university-c...

[+] maweki|6 months ago|reply

It's nice to see more semantic web experiments. I always wanted to do more reasoning with ontologies, etc., and it's such an amazing idea, to reference objects/persons/locations/concepts from the real world with uris and just add labeled arrows between them.

This is such a cool schemaless approach and has so much potential for open data linking, classical reasoning, LLM reasoning. But open data (together with RSS) has been dead for a while as all big companies have become just data hoarders. And frankly, while the concept and the possibilities are so cool, the graph databases are just not that fast and also not fun to program.

[+] thedudeabides5|6 months ago|reply

semantic web/OWL was always way too heavy to imagine humans using, you could imagine AI doing the heavy lifting here though..

[+] ivape|6 months ago|reply

I don’t know if it’s inadvertent, but it’s headed toward just becoming an engine for over fitted generalizations. Each casual pair will just emerge based on frequency, which will reinforce itself in preemptively and prematurely classifying all future information.

Unfortunately, frequency is the primary way AI works, but it will never be accurate for causality because causality always has the dynamic that things can happen just “because”. It’s hacked into LLMs via deliberate randomness in next-token prediction.

[+] growingkittens|6 months ago|reply

Organizing all knowledge requires a flexible system of organization (starting with how the categories are organized and accessed, not the data).

Random thoughts about organizing knowledge:

- Categories need fractal structures.

- Categories need to be available as subcategories to other categories as a pattern.

- Words need to be broken down into base concepts and used as patterns.

- Social information and context alter the meaning of words in many cases, so any semantic web without a control system has limited use as an organization tool.

[+] kgrizzle|6 months ago|reply

Reminds me of the cyc project. https://en.wikipedia.org/wiki/Cyc

[+] pfdietz|6 months ago|reply

This is difficult, but then I just had someone earnestly inform me that the covid virus doesn't cause covid, so I think there's a need here, if only to have an automated way of identifying idiots.

115 comments