The post mentions the idea that querying a database D can be understood algebraically as enumerating all morphisms Q -> D, where Q is the "classifying" database of the query, i.e. a minimal database instance that admits a single "generic" result of the query. You can use this to give a neat formulation of Datalog evaluation. A Datalog rule then corresponds a morphism P -> H, where P is the classifying database instance of the rule body and H is the classifying database instance for matches of both body and head. For example, for the the transitivity rule
edge(x, z) :- edge(x, y), edge(y, z).
you'd take for P the database instance containing two rows (a_1, a_2) and (a_2, a_3), and the database instance H contains additionally (a_1, a_3). Now saying that a Database D satisfies this rule means that every morphism P -> D (i.e., every match of the premise of the rule) can be completed to a commuting diagram
P --> D
| ^
| /
⌄ /
Q
where the additional map is the arrow Q -> D, which corresponds to a match of both body and head.
This kind of phenomenon is known in category theory as a "lifting property", and there's rich theory around it. For example, you can show in great generality that there's always a "free" way to add data to a database D so that it satisfies the lifting property (the orthogonal reflection construction/the small object argument). Those are the theoretical underpinnings of the Datalog engine I'm sometimes working on [1], and there they allow you to prove that Datalog evaluation is also well-defined if you allow adjoining new elements during evaluation in a controlled way. I believe the author of this post is involved in the egglog project [2], which might have similar features as well.
Thank you @xlinux and @mbid. Very interesting and not something I knew much about before.
I had a look at eglog and egglog and if I'm understanding things correctly then one possible use case is type inference and optimization. I'm particular I looked at the example in [1].
I'm thinking that this could be useful in the PRQL [2] compiler, in particular for:
a) inference of type restrictions on input relations and resultant output relation types,
b) optimization of resultant SQL queries.
Would you be able to comment on whether that's correct?
Any links to related examples, papers, or work would be appreciated. Thanks!
I actually started working on Eqlog because I wanted to use it to implement a type checker. You might want to skim the posts in my series on implementing a Hindley-Milner type system using Eqlog, starting here [1]. The meat is in posts 3 - 5.
The type checker of Eqlog is mostly implement in Eqlog itself [2]. The general idea is that your parser populates a Database with syntax nodes, which are represented as `...Node` types in the Eqlog program at [2], and then you propagate type information with Datalog/Eqlog evaluation. Afterwards, you check whether the Database contains certain patterns that you want to rule out, e.g. a variable that doesn't have a type [3].
There are still some unsolved problems if you're interested in writing the whole type checker in Datalog. For example, variable lookup requires quadratic memory when implemented in Datalog. I mention this and a possible solution at [4]. However, Datalog as is can probably still be useful for some subtasks during type checking. For example, the Rust compiler uses Datalog in some parts of the type checker I believe. Reach out via e.g. github to mbid@ if you'd like to discuss in more detail.
Regarding optimization you probably want to talk with somebody working with egglog, they have a dedicated Zulip [5]. I'd imagine that for prql you want to encode the algebraic rules of pipeline transformations, e.g. associativity of filter over append. Given the query AST, eqlog or egglog would give you all equivalent ways to write the query according to your rules. You'd then select the representation you estimate to be the most performant based on a score you assign to (sub)expression.
Very interesting perspective I hadn't heard before on datalog, thanks. How far does it go - can you interpret extensions of datalog (say negation or constrained existentials) in a nice categorical way, for instance? I've given this very little thought but I imagine you'd have issues with uniqueness of these "minimal" database instances, and I'm not sure what that means for these lifting properties.
(if my question even makes sense, pardon the ignorance)
If you're interested in the details, you might want to have a look at papers [1] or [2].
You can add existentials in this framework, which basically means that the lifting problems mentioned above don't need to have unique solutions. But as you say, then the "minimal" databases aren't determined uniquely up to isomorphism anymore. So the result of Datalog evaluation now depends on the order in which you apply rules.
If I recall correctly, then [3] discusses a logic corresponding to accessible categories (Datalog + equality corresponds to locally presentable categories) which includes the the theory of fields. The theory of fields involves the negation 0 != 1, so perhaps that might give you a nicer way to incorporate negations without stratification.
For anyone curious: the performance difference between Clang and GCC on the example C solution for verbal arithmetic comes down to Clang's auto-vectorisation (deducing SIMD) whilst GCC here sticks with scalar, which is why the counter brings Clang closer in line to GCC (https://godbolt.org/z/xfdxGvMYP), and it's actually a pretty nice example of auto-vectorisation (and its limitations) in action, which is a fun tangent from this article (given its relevance to high-performance SMT/SAT solving for CSP)
When SQL can't internally optimise a query into a more efficient constraint problem, unrolling joins is the key. This once MSSQL hacker got to the point of optimising queries with large amounts of joins or CTEs to just populating a single table's columns with one query per one to a few columns at a time (two minute locking queries down to about two seconds.) After that, I started using SQL to generate SQL and run that for really curly requirements. That gives you the ability to write queries that can search for a particular value in any column in any table, or find changes in the past 5 minutes in any column in any table within a fairly quick timeframe. And that's great for debugging applications that interface with the database or identifying rogue table changes. Without needing a transaction log. Programmer's paradise :)
Someone had an issue because SQLite failed to optimize the following query
select * from t where x = 'x' or '' = 'x'
Someone said that SQLite could not optimize out the "or '' = 'x'" because it would be too expensive to compute. Which is obviously true only for huge queries on tiny datasets.
Well... there's your problem. SQLite is not a general-purpose RDBMS, it is marketed as a replacement for "fopen()", a purpose for which it excels.
A similar product is the Microsoft Jet database engine, used in products such as Microsoft Exchange and Active Directory. Queries have to be more-or-less manually optimised by the developer, but they run faster and more consistently than they would with a general-purpose query engine designed for ad-hoc queries.
Optimizing out static expressions can be done in linear time at best. So if the number of clauses in WHERE is huge and the size of the underlying table is tiny (such as in the examples shown in the article we are commenting on), it will be better not to run the optimization.
But of course, in normal life, outside of the world of people having fun with Homomorphisms, queries are much smaller than databases.
True, but that doesn't mean doing additional work during the parse is free. Optimizing out static expressions will take additional time, and in general that additional time will be linear in the query size.
My argument is that, on average, it will more than pay for itself.
The only losing case, if there are any measurable ones, is where you have long queries and short data. I'd call that a case of "doing it wrong". Wrong tool for the job.
My guess is that the expense can be tricky to calculate since the additional optimization prior to executing the query may take longer than if the query was just able to run (depending on the dataset, of course). I wonder if it's too expensive to calculate a heuristic as well, so it just allows it to execute.
The post mentions the idea that querying a database D can be understood algebraically as enumerating all morphisms Q -> D, where Q is the "classifying" database of the query, i.e. a minimal database instance that admits a single "generic" result of the query. You can use this to give a neat formulation of Datalog evaluation. A Datalog rule then corresponds a morphism P -> H, where P is the classifying database instance of the rule body and H is the classifying database instance for matches of both body and head. For example, for the the transitivity rule
you'd take for P the database instance containing two rows (a_1, a_2) and (a_2, a_3), and the database instance H contains additionally (a_1, a_3). Now saying that a Database D satisfies this rule means that every morphism P -> D (i.e., every match of the premise of the rule) can be completed to a commuting diagram where the additional map is the arrow Q -> D, which corresponds to a match of both body and head.This kind of phenomenon is known in category theory as a "lifting property", and there's rich theory around it. For example, you can show in great generality that there's always a "free" way to add data to a database D so that it satisfies the lifting property (the orthogonal reflection construction/the small object argument). Those are the theoretical underpinnings of the Datalog engine I'm sometimes working on [1], and there they allow you to prove that Datalog evaluation is also well-defined if you allow adjoining new elements during evaluation in a controlled way. I believe the author of this post is involved in the egglog project [2], which might have similar features as well.
[1] https://github.com/eqlog/eqlog [2] https://github.com/egraphs-good/egglog
Thank you @xlinux and @mbid. Very interesting and not something I knew much about before.
I had a look at eglog and egglog and if I'm understanding things correctly then one possible use case is type inference and optimization. I'm particular I looked at the example in [1].
I'm thinking that this could be useful in the PRQL [2] compiler, in particular for: a) inference of type restrictions on input relations and resultant output relation types, b) optimization of resultant SQL queries.
Would you be able to comment on whether that's correct?
Any links to related examples, papers, or work would be appreciated. Thanks!
1: https://egglog-python.readthedocs.io/latest/tutorials/sklear...
2: https://www.prql-lang.org/
I actually started working on Eqlog because I wanted to use it to implement a type checker. You might want to skim the posts in my series on implementing a Hindley-Milner type system using Eqlog, starting here [1]. The meat is in posts 3 - 5.
The type checker of Eqlog is mostly implement in Eqlog itself [2]. The general idea is that your parser populates a Database with syntax nodes, which are represented as `...Node` types in the Eqlog program at [2], and then you propagate type information with Datalog/Eqlog evaluation. Afterwards, you check whether the Database contains certain patterns that you want to rule out, e.g. a variable that doesn't have a type [3].
There are still some unsolved problems if you're interested in writing the whole type checker in Datalog. For example, variable lookup requires quadratic memory when implemented in Datalog. I mention this and a possible solution at [4]. However, Datalog as is can probably still be useful for some subtasks during type checking. For example, the Rust compiler uses Datalog in some parts of the type checker I believe. Reach out via e.g. github to mbid@ if you'd like to discuss in more detail.
Regarding optimization you probably want to talk with somebody working with egglog, they have a dedicated Zulip [5]. I'd imagine that for prql you want to encode the algebraic rules of pipeline transformations, e.g. associativity of filter over append. Given the query AST, eqlog or egglog would give you all equivalent ways to write the query according to your rules. You'd then select the representation you estimate to be the most performant based on a score you assign to (sub)expression.
[1] https://www.mbid.me/posts/type-checking-with-eqlog-parsing/
[2] https://github.com/eqlog/eqlog/blob/9efb4d3cb3d9b024d681401b...
[3] https://github.com/eqlog/eqlog/blob/9efb4d3cb3d9b024d681401b...
[4] https://www.mbid.me/posts/dependent-types-for-datalog/#morph...
[5] https://egraphs.zulipchat.com
Thank you. Will try to get into this on the weekend. I'll reach out once I can ask a more informed question.
Very interesting perspective I hadn't heard before on datalog, thanks. How far does it go - can you interpret extensions of datalog (say negation or constrained existentials) in a nice categorical way, for instance? I've given this very little thought but I imagine you'd have issues with uniqueness of these "minimal" database instances, and I'm not sure what that means for these lifting properties.
(if my question even makes sense, pardon the ignorance)
If you're interested in the details, you might want to have a look at papers [1] or [2].
You can add existentials in this framework, which basically means that the lifting problems mentioned above don't need to have unique solutions. But as you say, then the "minimal" databases aren't determined uniquely up to isomorphism anymore. So the result of Datalog evaluation now depends on the order in which you apply rules.
If I recall correctly, then [3] discusses a logic corresponding to accessible categories (Datalog + equality corresponds to locally presentable categories) which includes the the theory of fields. The theory of fields involves the negation 0 != 1, so perhaps that might give you a nicer way to incorporate negations without stratification.
[1] https://www.mbid.me/eqlog-semantics/
[2] https://arxiv.org/abs/2205.02425
[3] Locally presentable and accessible categories, https://www.cambridge.org/core/books/locally-presentable-and...
Thanks for the references, those papers looks great! Will dig into them this evening =)
For anyone curious: the performance difference between Clang and GCC on the example C solution for verbal arithmetic comes down to Clang's auto-vectorisation (deducing SIMD) whilst GCC here sticks with scalar, which is why the counter brings Clang closer in line to GCC (https://godbolt.org/z/xfdxGvMYP), and it's actually a pretty nice example of auto-vectorisation (and its limitations) in action, which is a fun tangent from this article (given its relevance to high-performance SMT/SAT solving for CSP)
When SQL can't internally optimise a query into a more efficient constraint problem, unrolling joins is the key. This once MSSQL hacker got to the point of optimising queries with large amounts of joins or CTEs to just populating a single table's columns with one query per one to a few columns at a time (two minute locking queries down to about two seconds.) After that, I started using SQL to generate SQL and run that for really curly requirements. That gives you the ability to write queries that can search for a particular value in any column in any table, or find changes in the past 5 minutes in any column in any table within a fairly quick timeframe. And that's great for debugging applications that interface with the database or identifying rogue table changes. Without needing a transaction log. Programmer's paradise :)
The topic of huge queries on tiny databases makes me think of this recent discussion on the SQLite forum: https://sqlite.org/forum/forumpost/0d18320369
Someone had an issue because SQLite failed to optimize the following query
Someone said that SQLite could not optimize out the "or '' = 'x'" because it would be too expensive to compute. Which is obviously true only for huge queries on tiny datasets.> SQLite
Well... there's your problem. SQLite is not a general-purpose RDBMS, it is marketed as a replacement for "fopen()", a purpose for which it excels.
A similar product is the Microsoft Jet database engine, used in products such as Microsoft Exchange and Active Directory. Queries have to be more-or-less manually optimised by the developer, but they run faster and more consistently than they would with a general-purpose query engine designed for ad-hoc queries.
I hate Jet with a vengeance
It's not obviously true at all. Optimizing out `'' = 'x'` can be done for a fixed cost regardless of record count.
Optimizing out static expressions can be done in linear time at best. So if the number of clauses in WHERE is huge and the size of the underlying table is tiny (such as in the examples shown in the article we are commenting on), it will be better not to run the optimization.
But of course, in normal life, outside of the world of people having fun with Homomorphisms, queries are much smaller than databases.
Parsing the expression in the first place is already linear time.
True, but that doesn't mean doing additional work during the parse is free. Optimizing out static expressions will take additional time, and in general that additional time will be linear in the query size.
My argument is that, on average, it will more than pay for itself.
The only losing case, if there are any measurable ones, is where you have long queries and short data. I'd call that a case of "doing it wrong". Wrong tool for the job.
Why would it be too expensive to optimize out static subexpressions?
My guess is that the expense can be tricky to calculate since the additional optimization prior to executing the query may take longer than if the query was just able to run (depending on the dataset, of course). I wonder if it's too expensive to calculate a heuristic as well, so it just allows it to execute.
Just a guess.