xi's comments

xi | 4 years ago | on: PRQL – A proposal for a better SQL

Maybe you'd like to check FunSQL.jl, my library for compositional construction of SQL queries. It also follows algebraic approach and covers many analytical features of SQL including aggregates/window functions, recursive queries and correlated subqueries/lateral joins. One thing where it differs from dlpyr and similar packages is how it separates aggregation from grouping (by modeling GROUP BY with a universal aggregate function).

xi | 4 years ago | on: SQL: The difference between WHERE and HAVING

This is pretty close to how my Julia library [0] for composable construction of SQL queries works:

    From(foo) |>
    Where(Get.value .< 10) |>
    Join(From(bar) |> As(:bar), on = Get.bar.id .== Get.bar_id) |>
    Where(Get.bar.other_value .> 20) |>
    Select(Get.group, Get.value) |>
    Group(Get.group) |>
    Where(Agg.sum(Get.value) .> 100) |>
    Order(Get.group)
There is no HAVING and the you can use any tabular operators in any order. Aggregates are also separated from grouping and can be used in any context after Group is applied.

[0] https://github.com/MechanicalRabbit/FunSQL.jl

xi | 11 years ago | on: War in Europe is not a hysterical idea

A Russian native speaker does not equal a Russian. About 2/3 of the population of Donetsk region identify themselves as ethnic Ukrainians. More than half of those name Russian as their native language.

xi | 11 years ago | on: Live map of Russia advancing into Ukraine

Crimea was occupied by Russian forces in February. Military hostilities in mainland Ukraine started in April when Sloviansk has been captured by Russian ex-FSB officer Igor Girkin and his gang of Russian ultranationalists.

xi | 11 years ago | on: Russia enacts 'draconian' law for bloggers and online media

Which means that there is an "excess of 1.5 millions of people" in Donetsk Region, that "[people of] Donetsk Region mustn't be undestood [by the people from the rest of Ukraine], and Donetsk Region [and it's people] must be used as a resource instead" and "I don't know how to solve that problem [to remove excessive civilians], but the main thing is that some people must be physically eliminated".

Wow, this is truly creative editing. None of your insertions are implied from the context and the quotes you picked up are several minutes apart. In particular, he talks about 1.5 millions of people lacking meaningful job prospects as one of the causes of the unrest (which is true). A few minutes later, when he talks about about killing people, nowhere he implies millions of civilians, in fact, it's obvious he means armed militants.

xi | 11 years ago | on: Russia enacts 'draconian' law for bloggers and online media

Maybe the TV host didn't challenge him because his comments are already so backward and idiotic as to not need further emphasis.

I'll give you a better explanation: the TV host didn't challenge him because it never happened. The "expert" (another journalist, in fact) never suggested to "physically eliminate about 1.5 million of civilians of Donetsk and Luhansk regions that are not able to fit in Ukrainian Nation" or anything close to it.

xi | 12 years ago | on: How ‘Bacon and Eggs’ Became the American Breakfast

Orange juice?

...The Florida industry’s aggressive marketing of oranges and orange juice is a key feature of orange juice history, as it slowly developed demand for the product. In 1907 oranges became the first perishable fruit “ever” to be advertised. As crops expanded quickly, marketing became crucial to avoid overproduction. The growth of farmer cooperatives came largely out of a need to market the products. The Florida Citrus Exchange was organized in 1910 to market fresh citrus and also to do research on processing citrus. It created advertising programs and “built national and international sales organizations.”

http://shkrobius.livejournal.com/312073.html

xi | 13 years ago | on: Things you didn't know about Python

Armin's implementation of `cached_property` is not entirely correct. Well, it works, but the branch where `value` is not `missing` is never executed: the object's `__dict__` takes precedence over the descriptor as long as the descriptor does not define `__set__` method.

Here is an implementation of `cached_property` I use:

    class cached_property(object):

        def __init__(self, fget):
            self.fget = fget
            self.__name__ = fget.__name__
            self.__module__ = fget.__module__
            self.__doc__ = fget.__doc__

        def __get__(self, obj, objtype=None):
            if obj is None:
                return self
            value = self.fget(obj)
            # For a non-data descriptor (`__set__` is not defined),
            # `__dict__` takes precedence.
            obj.__dict__[self.__name__] = value
            return value

xi | 13 years ago | on: Stanford Biologist and Computer Scientist Discover the 'Anternet'

You are right, now that I read the original article, it appears the particular ant genus they were studying does not use pheromone trails. Since this ant species gather seeds which are scattered by wind and could be brought back by a single ant, they don't need to build paths to the food source. Instead a returning ant interacts with other ants which are ready to leave and the rate at which ants leave the nest grows with the rate at which ants return back (so a leaving ant is a data packet, a returning ant is an ACK packet).

The paper: http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fj...

xi | 13 years ago | on: Stanford Biologist and Computer Scientist Discover the 'Anternet'

AFAIK, it's completely decentralized. When a scout ant finds food, it returns back to the nest leaving a strong trail of pheromone. Ants tend to follow a pheromone trail and reinforce it on their return if they find food, so when a good source of food is found, a strong pathway is quickly established by a growing number of ants following the same route.

An ant may randomly leave a pheromone trail, which ensures that other, perhaps more attractive food sources are not missed. In general, the stronger the signal, the faster ants move with the less probability of diverging from the path.

This strategy ensures optimal area exploitation under varying conditions. For instance, if the food is concentrated in one or a few locations, you'll see a single column between the nest and the source, but when the food is scattered through the area, the ants would disperse too.

Source: an excellent book "Cells, Embryos and Evolution" where it was used as an example of how complex and seemingly organized and directed behavior could be achieved by a population of identical individuals acting under uniform rules.

xi | 13 years ago | on: OrmHate

I was about to prepare the query without ties and yes it is a lot harder than necessary and, more important IMO, much less readable/intuitive for people than have to maintain the code and that is exactly why different SQL dialects introduced ORDER BY/LIMIT/OFFSET/TOP/RANK etc.

Very good point.

In the book pointed out by Matt, Date explicitly states he's not saying ORDER BY is not useful, just that it doesn't return a relation and thus it's not included in the algebra.

My biggest gripe about ORDER BY, LIMIT and relational model is the fact that while Date and others made some attempts to express these operations in terms of relational algebra, they never (AFAIK) tried to do the opposite: alter the relational query model to naturally support them. It's not hard: just replace sets with sequences or arrays. It will gives you natural ORDER and SLICE operators as well as new aggregates FIRST, LAST, NTH. It solves duplicates without having to introduce bags, gives windowing functions for free and probably better represents how modern RDBMS interpret a query. Another hint why sequences may work better than sets is the fact that regular set operations such as INTERSECT and UNION (as opposed to UNION ALL, which becomes concatenation) are so rarely used in real-world queries.

I'm not even arguing that this is a good approach, but I think it deserves some discussion and it appears they never even thought of a possibility of changing the model treating it not as an instrument, but as a sacred scripture.

xi | 13 years ago | on: OrmHate

Thank you. I accept your answer with the note that you ignored my request to return only the latest post when there are more then one posts with the same number of comments, but it's not hard to adapt you query to satisfy this requirement.

However you can't do the same trick if I ask you to return the top 3 posts with the largest number of comments; or, to make the query more realistic, ask you to return the percentage of comments generated by the top 10% popular (by the number of comments) posts. Which is my point: pure relational algebra as advocated by Date et al in Tutorial D is less expressive than SQL, which probably explains the cold reception it got from the industry.

Edit: now that I think about it, you could do it without ORDER BY/LIMIT, but still it's harder than necessary.

xi | 13 years ago | on: OrmHate

That means #1 is totally relational (as an aside, you don't need ORDER BY or LIMIT for it either).

I would love to see it. Yes, you can do it in SQL, but I'd say it's not easy at all without ORDER BY and LIMIT or windowing functions and I don't know if you can do it in Tutorial D. For the reference, #1 is:

Show the blog post with the largest number of comments. If more than one exist, pick the latest.

The schema is:

    post(id integer, created timestamp)
    comment(id integer, post_id integer)
See CJ Date's excellent discussion on ORDER(BY)

I read it and the book as well, but I wouldn't call it excellent. What I read there is a reluctant admission of failure to incorporate an important operation to his query model. I see no attempt to analyze why it doesn't work or adapt the query model to make ORDER a regular operation.

xi | 13 years ago | on: OrmHate

People mean different things when they say relational model, so to clarify, by relational model I mean a model in which data is represented as sets of N-tuples of fixed structure, and queries are constructed using set-based operations such as filtering and Cartesian product.

Also, when I say path-based access, I mean access that follows predefined links between entities (in SQL, provided by FOREIGN KEY constraints). Those are well supported by object model and ORMs, as opposed to arbitrary joins, which aren't.

You don't need a relational model to represent one-to-many relations, in fact, an object model such as provided by many ORMs could represent them perfectly. In your first example, `figure.vertices` could be a list of vertices associated with a figure, and `vertex.figure` is the figure which owns the vertex. Similarly mutual object or list references could represent other singular or plural relationships. Though I agree it requires referential loops and cannot be expressed well in a pure hierarchical model such as in many novel no-sql databases.

xi | 13 years ago | on: OrmHate

ORDER BY, LIMIT/OFFSET have to do with presentation of data. So, although it's highly desired that a language based on the relational model supports them, they have nothing to do with the model per se.

By fit poorly I mean that you cannot express most real-world business inquiries using pure relational primitives without ORDER BY or LIMIT/OFFSET and that's why I think relational algebra is not usable per se. SQL fixed this problem by adding many non-relational constructs, but but without any sense of consistency or direction.

I also strongly disagree that ORDER BY and LIMIT/OFFSET are presentational operations since I often use them not only for wrapping the outer SELECT, but also within correlated subqueries.

To show some proof, here are a few queries which are hard or impossible to express in relational algebra:

1. Show the blog post with the largest number of comments [^].

2. Show the tags associated with the blog post with the largest number of comments.

3. For each blog category, show the 3 top blog posts by the number of comments.

[^] If more than one exist, pick the latest.

NULL is a completely different beast and this is the only real thing one can consider problematic.

I think NULL is only hard because relational model is a wrong way to look at the data. If you see an entity attribute not as a column of a tuple, but as a function from an entity set to some value domain, the fact that the attribute is nullable just means that the function is not total. There is a well developed mathematical apparatus for partial functions, in which NULL becomes a bottom value injected to the value domain, and tri-valued logic is simply a monotonic extension of regular Boolean operators.

xi | 13 years ago | on: OrmHate

People conflate "relational" with "SQL", because of the historical accident that SQL is the most popular way to query relational data. Then when SQL isn't a good fit for their problem, they think relational is not a good fit for their problem, which is almost certainly not true.

For most practical purposes, SQL is the only way to query relational data. In the absence of alternatives, it's natural to conflate the notions of SQL databases and relational databases. I agree that SQL is a mess, but I don't think an approach based on pure relational primitives would make it better; in fact, I think SQL is a mess specifically because of lack of expressiveness in the pure relational primitives. NULL, ORDER BY, LIMIT/OFFSET, opaque keys, windowing functions, transitive closures, etc fit poorly into the relational model.

The original motivation for relational databases is to have path-independent access to data. This is a really powerful idea.

I agree with both assertions, but in many applications, path-based access makes the total majority of queries, and SQL or relational model provides little means to distinguish them from other equi-joins or arbitrary join conditions. In my opinion, it would be better to start with a navigational path-based database model, and extend it to allow constructing new paths dynamically.

page 1