We adopted ClickHouse ~4 years ago. We COULD have stayed on just Postgres. With a lot of bells, whistles, aggregation, denormalisation, aggressive retention limits and job queues etc. we could have gotten acceptable response times for our interactive dashboard.
But we chose ClickHouse and now we just pump in data with little to no optimization.
We migrated some analytics workloads from postgres to clickhouse last year, it's crazy how fast it is. It feels like alien technology from the future in comparison.
I imagine with Postgres there's also an option of using a plugin like Greenplum or something else, which may help to bridge the gap, but probably not to the level of ClickHouse.
God clickhouse is such great software, if it only it was as ergonomic as duckdb, and management wasn't doing some questionable things (deleting references to competitors in GH issues, weird legal letters, etc.)
The CH contributors are really stellar, from multiple companies (Altinity, Tinybird, Cloudflare, ClickHouse)
Correct! Would love to have the Go package come as a single dependency without having to distribute `.so` files. That's what's stopping me from using `chDB` now instead of DuckDB. Being able to use chDB in a static manner would also help deepen my usage of the Clickhouse server. Right now the Clickhouse side of my project is lagging behind the DuckDB one because of this.
chdb and clickhouse-local is nearly as ergonomic as duckdb with all of the features of ch.
duckdb has unfortunately been leaning away from pure oss - the ui they released is entirely hosted on motherduck’s servers (which, while an awesome project, makes me feel like the project will be cannibalized by a proprietary extensions.)
Actually this is not important. Because they need to make a living, and with that they have to make some functonality of ClickHouse propritary.
You can scale the pure open source project really far. And if you need more, you do have money to pay for it.
People need to stop thinking that open source is free leech. Open source is about sharing knowledge and building trust between each other. But we are still living in a business world where we are competing.
It is important to decorelate compute and Storage.
I think people should be aware that the Core Storage work is not beeing done in the open anymore. The process of letting the open source core grow useless while selling proprietary addons is fine. I just have a problem with people calling this open source.
Some prefer using open source. Are they leeches ?
Regarding scaling open source projects, did Linux scale ?
I really like Clickhouse. Discovered it recently, and man, it's such a breath of fresh air compared to suboptimal solutions I used for analytics. It's so fast and the CLI is also a joy to work with.
I always dismissed ClickHouse, because it's all super low level. Building a reliable system out of it, requires a lot of internal knowledge. This is the only DB I know, where you will have to deal with actual files on disk, in case of problems.
However, I managed to look besides that, and oh-my-god it is so fast. It's like the tool is optimized for raw speed and whatever you do with it is up for you.
Yeah ClickHouse does feel like adult LEGO to me too: it lets you design your data structures and data storage layout, but doesn't force you to implement everything else. If you work on a large enough scale that's exactly what you want from a system usually
Same here. I come from a strong Postgres and Microsoft SQL Server background and I was able to get up to speed with it, ingesting real data from text files, in an afternoon. I was really impressed with the docs as well as the performance of the software.
Having a SQL like syntax where everything feels like a normal DB helps a lot I think. Of course, it works very differently behind the scenes but not having to learn a bunch of new things just to use a new data model is a good approach.
I get why some create new dialects and languages as that way there is less ambiguity and therefore harder to use incorrectly but I think ClickHouse made the right tradeoffs here.
I remember a few years ago when the views on Clickhouse was it some "legacy" "bulky" and used by "the big guys" and not very much discussion or opinions of it in spaces like this. Seems like its come a long way.
Lots of Google analytics competitors appeared between 2017 and 2023 due to privacy reasons. And a lot of them started with normal Postgres or MySQL then switched to Clickhouse or simply started with Clickhouse knowing they could scale far better.
At least in terms of capability and reputation it was already well known by 2021 and certainly not legacy or bulky. At least on HN clickhouse is very often submitted and reached front page. Compared to MySQL when I tried multiple times no one is interested.
Edit: On another note Umami is finally supporting Clickhouse! [1], Not sure how they implementing it because it still requires Postgres. But it should hopefully be a lot more scalable.
Or may be Heavy duty? Although I remember a lot of people were sceptical of CH simply because it came from Yandex from Russia. And that was before the war.
In my understanding DuckDB doesn't have its own optimised storage that can accept writes (in a sense that ClickHouse does, where it's native storage format gives you best performance), and instead relies on e.g. reading data from Parquet and other formats. That makes sense for an embedded analytics engine on top of existing files, but might be a problem if you wanted to use DuckDB e.g. for real-time analytics where the inserted data needs to be available for querying in a few seconds after it's been inserted. ClickHouse was designed for the latter use case, but at a cost of being a full-fledged standalone service by design. There are embedded versions of ClickHouse, but they are much bulkier and generally less ergonomic to use (although that's a personal preference)
This optimization should provide dramatic speed-ups when taking random samples from massive data sets, especially when the wanted columns can contain large values. That's because the basic SQL recipe relies on a LIMIT clause to determine which rows are in the sample (see query below), and this new optimization promises to defer reading the big columns until the LIMIT clause has filtered the data set down to a tiny number of lucky rows.
SELECT *
FROM Population
WHERE weight > 0
ORDER BY -LN(1.0 - RANDOM()) / weight
LIMIT 100 -- Sample size.
Can anyone from ClickHouse verify that the lazy-materialization optimization speeds up queries like this one? (I want to make sure the randomization in the ORDER BY clause doesn't prevent the optimization.)
Thanks! That's a nice 5x improvement. Pretty good for a query that offers only modest opportunity, given that the few columns it asks for are fairly small (`title` being the largest, which isn't that large).
EXPLAIN plan actions = 1
SELECT *
FROM amazon.amazon_reviews
WHERE helpful_votes > 0
ORDER BY -log(1 - (rand() / 4294967296.0)) / helpful_votes
LIMIT 3
Note that there is a setting query_plan_max_limit_for_lazy_materialization (default value 10) that controls the max n for which lm kicks in for LIMIT n.
Sorry if this question exposes my naivety, why such a low default limit? What drawback does lazy materialization have that makes it good to have such a low limit?
Do you know any example query where lazy materialization is detrimental to performance?
My understanding is that with higher limit values you may end up doing lots of random I/O (for each granule the order in which you read it would be much less predictable than when ClickHouse normally reads it sequentially), essentially one I/O operation per LIMIT value. So larger default values would only be beneficial in pathological examples given in the article, but much less so in "real world".
The optimization should work well for your sampling query since the ORDER BY and LIMIT operations would happen before materializing the large columns, but the randomization function might force early evaluation - worth benchmarking both approaches.
Unrelated to the new materialization option, this caught my eye:
"this query sorts all 150 million values in the helpful_votes column (which isn’t part of the table’s sort key) and returns the top 3, in just 70 milliseconds cold (with the OS filesystem cache cleared beforehand) and a processing throughput of 2.15 billion rows/s"
I clearly need to update my mental model of what might be a slow query against modern hardware and software. Looks like that's so fast because in a columnar database it only has to load that 150 million value column. I guess sorting 150 million integers in 70ms shouldn't be surprising.
(Also "Peak memory usage: 3.59 MiB" for that? Nice.)
This is a really great article - very clearly explained, good diagrams, I learned a bunch from it.
> I guess sorting 150 million integers in 70ms shouldn't be surprising.
I find sorting 150M integers at all to be surprising. The query asks for finding the top 3 elements and returning those elements, sorted. This can be done trivially by keeping the best three found so far and scanning the list. This should operate at nearly the speed of memory and use effectively zero additional storage. I don’t know whether Clickhouse does this optimization, but I didn’t see it mentioned.
Generically, one can find the kth best of n elements in time O(n):
And one can scan again to find the top k, plus some extra if the kth best wasn’t unique, but that issue is manageable and, I think, adds at most a factor of 2 overhead if one is careful (collect up to k elements that compare equal to the kth best and collect up to k that are better than it). Total complexity is O(n) if you don’t need the result sorted or O(n + k log k) if you do.
If you’re not allowed to mutate the input (which probably applies to Clickhouse-style massive streaming reads), you can collect the top k in a separate data structure, and straightforward implementations are O(n log k). I wouldn’t be surprised if using a fancy heap or taking advantage of the data being integers with smallish numbers of bits does better, but I haven’t tried to find a solution or disprove the existence of one.
I am the author of the optimization of partial sorting and selection in Clickhouse. It uses Floyd-Rivest algorithm and we tried a lot of different things back at the time, read [1]
Overall clickhouse reads blocks of fixed sizes (64k) and finds top elements and then does top of the top until it converges.
With non-mutable “streaming” input, there is an O(n) algorithm to obtain the unsorted top k with only O(k) extra memory.
The basic idea is to maintain a buffer of size 2k, run mutable unsorted top k on that, drop the smaller half (i.e the lowest k elements), then stream in the next k elements from the main list. Each iteration takes O(k), but you’re processing k elements at a time, so overall runtime is O(n).
When you’re done, you can of course sort for an additional k*log(k) cost.
> This can be done trivially by keeping the best three found so far and scanning the list.
That doesnt seem to guarantee correctness. If you dont track all of the unique values, at least, you could be throwing away one of the most common values.
The wiki entry seems to be specifically about the smallest, rather than largest values.
The max-heap algorithm alluded to above is correct. You fill it with the first k values scanned, then peek at the max element for each subsequent value. If the current value is smaller than the max element, you evict the max element and insert the new element. This streaming top-k algorithm is ubiquitous in both leetcode interviews and applications. (The standard quickselect top-k algorithm is not useful in the streaming context because it requires random access and in-place mutation.)
To be fair to quickselect, I can imagine a lazy data processing framework having a concept of a lazily sorted data column where the actual data has been materialized but it’s not in sorted order yet. Then someone does “LIMIT k” to it, and the framework can go to town with quickselect.
As noted a couple times in this thread, there are all kinds of tradeoffs here, and I can’t imagine quickselect being even close to competitive for k that is small enough to fit in cache. Quickselect will, in general, scan a large input approximately twice. For k = 3, the answer fits in general-purpose registers or even in a single SIMD register, and a single scan with brute force accumulation of the answer will beat quickselect handily and will also beat any sort of log-time heap.
(In general, more advanced and asymptotically better algorithms often lose to simpler brute force algorithms when the parameters in question are smallish.)
Yeah, obviously I wouldn't bother with a heap for k=3. A heap has good compactness but poor locality, so I guess it wouldn't perform well out of (some level of) cache.
So quickselect needs multiple passes, and the heap needs O(n log k) time to find the top k elements of n elements total.
However, you can find the top k elements in O(n) time and O(k) space in a single pass.
One simple way: you keep a buffer of up to 2*k elements. You scan your stream of n items one by one. Whenever your buffer gets full, you pare it back down to k elements with your favourite selection algorithm (like quickselect).
As a minor optimisation, you can only add items to your buffer, if they improve on the worst element in your buffer (or when you haven't hit k elements in your buffer, yet).
As an empirical question, you can also experiment with the size of the buffer. Theoretically any multiple of k will do (even 1.1*k or so), but in practice they give you different constant factors for space and time.
How do you efficiently track the "worst element" without something like a max-heap? But yeah, this is a fun algorithm. I think I've seen it before but can't place it, do you remember where you came across it?
Why is that interesting? Intuitively a worst-case could be a stream of n-1 unique elements out of n with the duplicate at the end, so there is no way around O(n) space. Any element could be the most common so you must keep them all.
The algorithms on the Wikipedia page quoted actually solve a different problem. And they can do that in constant space.
So if someone tells you that one item in the stream is repeated so often that it occurs at least p% of the time (say 10%), then these algorithms can find such an element. But eg if they are multiple elements that occur more than p% of the time, they are not guaranteed to give you the one that occurs the most often. Nor are they guaranteed to give you any meaningful output, if the assumption is violated and no element occurs at least p% of the time.
Sure, a similar trivial argument applies to the linear-space lower bound for set membership. But these linear lower bounds motivate the search for approximate techniques with sublinear lower bounds (although bloom filters or fingerprint tables are not actually sublinear).
What? The algorithm is completely symmetrical with respect to smallest or largest, and fully correct and general. I don't understand the problem with unique values. Could you provide a minimal input demonstrating the issue?
I cant because I completely misread the wiki article before commenting and have now read it more carefully and realized I was wrong. Specifically I went in thinking about top 3 most common value.
Let's do a back of the envelope calculation. 150M u32 integers are 600MB. Modern SSD can do 14,000MB/s sequential read [1]. So reading 600MB takes about 600MB / 14,000MB/s = 43ms.
Memory like DDR4 can do 25GB/s [2]. It can go over 600MB in 600MB / 25,000MB/s = 24ms.
L1/L2 can do 1TB/s [3]. There're 32 CPU's, so it's roughly 32TB/s of L1/L2 bandwidth. 600MB can be processed by 32TB/s in 0.018ms. With 3ms budget, they can process the 600MB data 166 times.
The rank selection algorithms like QuickSelect and Floyd-Rivest have O(N) complexity. It's entirely possible to process 600MB in 70ms.
Strong and up to date intuition on "slow vs. fast" queries is an underrated software engineering skill. Reading blogs like this one is worth it just for that alone.
Slow VMs on overprovisioned cloud hosts which cost as much per month as a dedicated box per year have broken a generation of engineers.
You could host so much from your macbook. The average HN startup could be hosted on a $200 minipc from a closet for the first couple of years if not more - and I'm talking expensive here for the extra RAM you want to not restart every hour when you have a memory leak.
I don't see how that's the root cause. ClickHouse and snowflake run on your so-called slow vms on overprovisioned cloud hosts and they're efficient as hell. It's all about your optimizations.
The real problem is the lack of understanding by most engineers the degree of overprovisioning they do for code that's simple and doing stupid things using an inefficient 4th order language on top of 5 different useless (imo) abstractions.
Strangely I've found inverse to be true: many backend technologies are actually quite good with memory management and often require as little as a few GiB of RAM or even less to serve production traffic. Often a single IDE consumes more RAM than a production Go binary that serves thousands of requests per second for example
Raw compute wise, you're almost right (almost because real cloud hosts aren't overprovisioned, you get the full CPU/memory/disk reserved for you).
But you actually need more than compute. You might need a database, cache, message broker, scheduler, to send emails, and a million other things you can always DIY with FOSS software, but take time. If you have more money than time, get off the shelf services that provide those with guarantees and maintenance; if not, the DIY route is also great for learning.
My point is all of this can be hosted on a single bare metal box, a small one at that! We used to do just that back in mid naughts and computers only got faster. Half of those cloud services are preconfigured FOSS derivatives behind the scenes anyway (probably…)
>Despite the airport drama, I’m still set on that beach holiday, and that means loading my eReader with only the best.
What a nice touch. Technical information and diagrams in this were top notch, but the fact there was also some kind of narrative threaded in really put it over the top for me.
IMHO if ClickHouse had Windows native release that does not need WSL or a Linux virtual machine it would be more popular than DuckDB. I remember for years MySQL being way more popular than PostgreSQL. One of the reasons being MySQL had a Windows installer.
Windows still runs on 71% of the desktop and laptops [1]. In my experience a good number of applications start life on simple desktops and then graduate to servers if they are successful. I work in the field of analytics. I have a locked down Windows desktop and I have been able to try out all the other databases such as MySQL, MariaDB, PostgreSQL and DuckDB because they have windows installers or portable apps. I haven't been able to try out ClickHouse. This is my experience and YMMV.
Has anyone compared ClickHouse and StarRocks[0]? Join performance seems a lot better on StarRocks a few months ago but I'm not sure if that still holds true.
It's quite amazing how a db like this shows that all of those row-based dbs are doing something wrong, they can't even approach these speeds with btree index structures. I know they like transactions more than Clickhouse, but it's just amazing to see how fast modern machines are, billions of rows per second.
I'm pretty sure they did not even bother to properly compress the dataset, with some tweaking, could have probably been much smaller than 30GBs. The speed shows that reading the data is slower than decompressing it.
Reminds me of that Cloudflare article where they had a similar idea about encryption being free (slower to read than to decrypt) and finding a bug, that when fixed, materialized this behavior.
Some of the “new SQL” hybrid (HTAP, hybrid transaction-analytical processing) databases might be of interest to you. TiDB is the main example off the top of my head.
Obvious solutions are often hard to do right. I bet the code that was needed to pull this off is either very complex or took a long time to write (and test). Or both.
This is a well-known class of optimization and the literature term is “late materialization”. It is a large set of strategies including this one. Late materialization is about as old as column stores themselves.
We adopted ClickHouse ~4 years ago. We COULD have stayed on just Postgres. With a lot of bells, whistles, aggregation, denormalisation, aggressive retention limits and job queues etc. we could have gotten acceptable response times for our interactive dashboard.
But we chose ClickHouse and now we just pump in data with little to no optimization.
We migrated some analytics workloads from postgres to clickhouse last year, it's crazy how fast it is. It feels like alien technology from the future in comparison.
are those like embedded analytics in the app or internal BI type workloads ?
I imagine with Postgres there's also an option of using a plugin like Greenplum or something else, which may help to bridge the gap, but probably not to the level of ClickHouse.
There's foreign data wrappers for Clickhouse that still allow Postgres as single point of consumption with all the benefits of Clickhouse deployment.
This is how we consume Langfuse traces!
God clickhouse is such great software, if it only it was as ergonomic as duckdb, and management wasn't doing some questionable things (deleting references to competitors in GH issues, weird legal letters, etc.)
The CH contributors are really stellar, from multiple companies (Altinity, Tinybird, Cloudflare, ClickHouse)
Tbh, I'm not too worried about CH management. Let them mod the GH if they want to, it's their company.
Contribution is strong and varied enough that I think we're good for the long term.
They have an interesting version that's packaged a bit like DuckDB - you can even "pip install" it: https://github.com/chdb-io/chdb
What is the use case for an embedded terabyte-scale database, by the way?
They don't do static builds AFAICT, which would make it a real competitor to DuckDB.
chDB author here, You are right, we have not made a static libchDB. BTW, I guess you are a golang developer?
Correct! Would love to have the Go package come as a single dependency without having to distribute `.so` files. That's what's stopping me from using `chDB` now instead of DuckDB. Being able to use chDB in a static manner would also help deepen my usage of the Clickhouse server. Right now the Clickhouse side of my project is lagging behind the DuckDB one because of this.
That's great feedback, thank you! I just added your comment to the GH issue: https://github.com/chdb-io/chdb/issues/101#issuecomment-2824...
Ps. I work for ClickHouse
chdb and clickhouse-local is nearly as ergonomic as duckdb with all of the features of ch.
duckdb has unfortunately been leaning away from pure oss - the ui they released is entirely hosted on motherduck’s servers (which, while an awesome project, makes me feel like the project will be cannibalized by a proprietary extensions.)
ClickHouse too. Their sharedmergetree is not open source at all. It makes ClickHouse OSS obsolete design-wise. Shame.
Actually this is not important. Because they need to make a living, and with that they have to make some functonality of ClickHouse propritary.
You can scale the pure open source project really far. And if you need more, you do have money to pay for it.
People need to stop thinking that open source is free leech. Open source is about sharing knowledge and building trust between each other. But we are still living in a business world where we are competing.
It is important to decorelate compute and Storage.
I think people should be aware that the Core Storage work is not beeing done in the open anymore. The process of letting the open source core grow useless while selling proprietary addons is fine. I just have a problem with people calling this open source.
Some prefer using open source. Are they leeches ?
Regarding scaling open source projects, did Linux scale ?
> the ui they released is entirely hosted on motherduck’s servers
who is they in this sentence ? afaik UI was released by motherduck a private company.
Late Materialization, 19 years later.
https://dspace.mit.edu/bitstream/handle/1721.1/34929/MIT-CSA...
I really like Clickhouse. Discovered it recently, and man, it's such a breath of fresh air compared to suboptimal solutions I used for analytics. It's so fast and the CLI is also a joy to work with.
I always dismissed ClickHouse, because it's all super low level. Building a reliable system out of it, requires a lot of internal knowledge. This is the only DB I know, where you will have to deal with actual files on disk, in case of problems.
However, I managed to look besides that, and oh-my-god it is so fast. It's like the tool is optimized for raw speed and whatever you do with it is up for you.
Yeah ClickHouse does feel like adult LEGO to me too: it lets you design your data structures and data storage layout, but doesn't force you to implement everything else. If you work on a large enough scale that's exactly what you want from a system usually
Same here. I come from a strong Postgres and Microsoft SQL Server background and I was able to get up to speed with it, ingesting real data from text files, in an afternoon. I was really impressed with the docs as well as the performance of the software.
Having a SQL like syntax where everything feels like a normal DB helps a lot I think. Of course, it works very differently behind the scenes but not having to learn a bunch of new things just to use a new data model is a good approach.
I get why some create new dialects and languages as that way there is less ambiguity and therefore harder to use incorrectly but I think ClickHouse made the right tradeoffs here.
I remember a few years ago when the views on Clickhouse was it some "legacy" "bulky" and used by "the big guys" and not very much discussion or opinions of it in spaces like this. Seems like its come a long way.
Lots of Google analytics competitors appeared between 2017 and 2023 due to privacy reasons. And a lot of them started with normal Postgres or MySQL then switched to Clickhouse or simply started with Clickhouse knowing they could scale far better.
At least in terms of capability and reputation it was already well known by 2021 and certainly not legacy or bulky. At least on HN clickhouse is very often submitted and reached front page. Compared to MySQL when I tried multiple times no one is interested.
Edit: On another note Umami is finally supporting Clickhouse! [1], Not sure how they implementing it because it still requires Postgres. But it should hopefully be a lot more scalable.
[1] https://github.com/umami-software/umami/issues/3227
Perhaps not legacy/bulky buy it maybe... enterprisey? I just remember having the same reaction to CH as to hearing Oracle.
Or may be Heavy duty? Although I remember a lot of people were sceptical of CH simply because it came from Yandex from Russia. And that was before the war.
Clickhouse earned that reputation. However, it was spun out of Yandex in 2021. That kickstarted a new wave of development and it’s gotten much better.
Ah that must explain a lot of it.
How does it compare to duckdb and/or polars?
This is very much an active space, so the half-life of in depth analyses is limited, but one of the best write ups from about 1.5 years ago is this one: https://bicortex.com/duckdb-vs-clickhouse-performance-compar...
In my understanding DuckDB doesn't have its own optimised storage that can accept writes (in a sense that ClickHouse does, where it's native storage format gives you best performance), and instead relies on e.g. reading data from Parquet and other formats. That makes sense for an embedded analytics engine on top of existing files, but might be a problem if you wanted to use DuckDB e.g. for real-time analytics where the inserted data needs to be available for querying in a few seconds after it's been inserted. ClickHouse was designed for the latter use case, but at a cost of being a full-fledged standalone service by design. There are embedded versions of ClickHouse, but they are much bulkier and generally less ergonomic to use (although that's a personal preference)
What's the story with the Debian package for it? It was removed as unmaintained.
This optimization should provide dramatic speed-ups when taking random samples from massive data sets, especially when the wanted columns can contain large values. That's because the basic SQL recipe relies on a LIMIT clause to determine which rows are in the sample (see query below), and this new optimization promises to defer reading the big columns until the LIMIT clause has filtered the data set down to a tiny number of lucky rows.
Can anyone from ClickHouse verify that the lazy-materialization optimization speeds up queries like this one? (I want to make sure the randomization in the ORDER BY clause doesn't prevent the optimization.)I checked, and yes - it works: https://pastila.nl/?002a2e01/31807bae7e114ca343577d263be7845...
Thanks! That's a nice 5x improvement. Pretty good for a query that offers only modest opportunity, given that the few columns it asks for are fairly small (`title` being the largest, which isn't that large).
Verified:
Lazily read columns: review_body, review_headline, verified_purchase, vine, total_votes, marketplace, star_rating, product_category, customer_id, product_title, product_id, product_parent, review_date, review_idNote that there is a setting query_plan_max_limit_for_lazy_materialization (default value 10) that controls the max n for which lm kicks in for LIMIT n.
Sorry if this question exposes my naivety, why such a low default limit? What drawback does lazy materialization have that makes it good to have such a low limit?
Do you know any example query where lazy materialization is detrimental to performance?
My understanding is that with higher limit values you may end up doing lots of random I/O (for each granule the order in which you read it would be much less predictable than when ClickHouse normally reads it sequentially), essentially one I/O operation per LIMIT value. So larger default values would only be beneficial in pathological examples given in the article, but much less so in "real world".
Awesome! Thanks for checking :-)
The optimization should work well for your sampling query since the ORDER BY and LIMIT operations would happen before materializing the large columns, but the randomization function might force early evaluation - worth benchmarking both approaches.
Unrelated to the new materialization option, this caught my eye:
"this query sorts all 150 million values in the helpful_votes column (which isn’t part of the table’s sort key) and returns the top 3, in just 70 milliseconds cold (with the OS filesystem cache cleared beforehand) and a processing throughput of 2.15 billion rows/s"
I clearly need to update my mental model of what might be a slow query against modern hardware and software. Looks like that's so fast because in a columnar database it only has to load that 150 million value column. I guess sorting 150 million integers in 70ms shouldn't be surprising.
(Also "Peak memory usage: 3.59 MiB" for that? Nice.)
This is a really great article - very clearly explained, good diagrams, I learned a bunch from it.
> I guess sorting 150 million integers in 70ms shouldn't be surprising.
I find sorting 150M integers at all to be surprising. The query asks for finding the top 3 elements and returning those elements, sorted. This can be done trivially by keeping the best three found so far and scanning the list. This should operate at nearly the speed of memory and use effectively zero additional storage. I don’t know whether Clickhouse does this optimization, but I didn’t see it mentioned.
Generically, one can find the kth best of n elements in time O(n):
https://en.m.wikipedia.org/wiki/Selection_algorithm
And one can scan again to find the top k, plus some extra if the kth best wasn’t unique, but that issue is manageable and, I think, adds at most a factor of 2 overhead if one is careful (collect up to k elements that compare equal to the kth best and collect up to k that are better than it). Total complexity is O(n) if you don’t need the result sorted or O(n + k log k) if you do.
If you’re not allowed to mutate the input (which probably applies to Clickhouse-style massive streaming reads), you can collect the top k in a separate data structure, and straightforward implementations are O(n log k). I wouldn’t be surprised if using a fancy heap or taking advantage of the data being integers with smallish numbers of bits does better, but I haven’t tried to find a solution or disprove the existence of one.
I am the author of the optimization of partial sorting and selection in Clickhouse. It uses Floyd-Rivest algorithm and we tried a lot of different things back at the time, read [1]
Overall clickhouse reads blocks of fixed sizes (64k) and finds top elements and then does top of the top until it converges.
[1] https://danlark.org/2020/11/11/miniselect-practical-and-gene...
With non-mutable “streaming” input, there is an O(n) algorithm to obtain the unsorted top k with only O(k) extra memory.
The basic idea is to maintain a buffer of size 2k, run mutable unsorted top k on that, drop the smaller half (i.e the lowest k elements), then stream in the next k elements from the main list. Each iteration takes O(k), but you’re processing k elements at a time, so overall runtime is O(n).
When you’re done, you can of course sort for an additional k*log(k) cost.
> This can be done trivially by keeping the best three found so far and scanning the list.
That doesnt seem to guarantee correctness. If you dont track all of the unique values, at least, you could be throwing away one of the most common values.
The wiki entry seems to be specifically about the smallest, rather than largest values.
The max-heap algorithm alluded to above is correct. You fill it with the first k values scanned, then peek at the max element for each subsequent value. If the current value is smaller than the max element, you evict the max element and insert the new element. This streaming top-k algorithm is ubiquitous in both leetcode interviews and applications. (The standard quickselect top-k algorithm is not useful in the streaming context because it requires random access and in-place mutation.)
To be fair to quickselect, I can imagine a lazy data processing framework having a concept of a lazily sorted data column where the actual data has been materialized but it’s not in sorted order yet. Then someone does “LIMIT k” to it, and the framework can go to town with quickselect.
As noted a couple times in this thread, there are all kinds of tradeoffs here, and I can’t imagine quickselect being even close to competitive for k that is small enough to fit in cache. Quickselect will, in general, scan a large input approximately twice. For k = 3, the answer fits in general-purpose registers or even in a single SIMD register, and a single scan with brute force accumulation of the answer will beat quickselect handily and will also beat any sort of log-time heap.
(In general, more advanced and asymptotically better algorithms often lose to simpler brute force algorithms when the parameters in question are smallish.)
Yeah, obviously I wouldn't bother with a heap for k=3. A heap has good compactness but poor locality, so I guess it wouldn't perform well out of (some level of) cache.
So quickselect needs multiple passes, and the heap needs O(n log k) time to find the top k elements of n elements total.
However, you can find the top k elements in O(n) time and O(k) space in a single pass.
One simple way: you keep a buffer of up to 2*k elements. You scan your stream of n items one by one. Whenever your buffer gets full, you pare it back down to k elements with your favourite selection algorithm (like quickselect).
As a minor optimisation, you can only add items to your buffer, if they improve on the worst element in your buffer (or when you haven't hit k elements in your buffer, yet).
As an empirical question, you can also experiment with the size of the buffer. Theoretically any multiple of k will do (even 1.1*k or so), but in practice they give you different constant factors for space and time.
How do you efficiently track the "worst element" without something like a max-heap? But yeah, this is a fun algorithm. I think I've seen it before but can't place it, do you remember where you came across it?
My failure was misreading it as most common k rather than max k.
Most common k is super-interesting because it can't be solved in one pass in constant space!
https://en.wikipedia.org/wiki/Streaming_algorithm#Frequent_e...
What you are quoting solves a very different problem. It doesn't give you the most common k (in general).
Why is that interesting? Intuitively a worst-case could be a stream of n-1 unique elements out of n with the duplicate at the end, so there is no way around O(n) space. Any element could be the most common so you must keep them all.
The algorithms on the Wikipedia page quoted actually solve a different problem. And they can do that in constant space.
So if someone tells you that one item in the stream is repeated so often that it occurs at least p% of the time (say 10%), then these algorithms can find such an element. But eg if they are multiple elements that occur more than p% of the time, they are not guaranteed to give you the one that occurs the most often. Nor are they guaranteed to give you any meaningful output, if the assumption is violated and no element occurs at least p% of the time.
Sure, a similar trivial argument applies to the linear-space lower bound for set membership. But these linear lower bounds motivate the search for approximate techniques with sublinear lower bounds (although bloom filters or fingerprint tables are not actually sublinear).
With an equality that returns true/false, this guarantees correctness. If there can be 3 best/biggest/smallest values, this technique works.
What? The algorithm is completely symmetrical with respect to smallest or largest, and fully correct and general. I don't understand the problem with unique values. Could you provide a minimal input demonstrating the issue?
I cant because I completely misread the wiki article before commenting and have now read it more carefully and realized I was wrong. Specifically I went in thinking about top 3 most common value.
Maybe they do have that optimization and that explains the 3.59 MiB peak memory usage for ~600MB of integers.
Let's do a back of the envelope calculation. 150M u32 integers are 600MB. Modern SSD can do 14,000MB/s sequential read [1]. So reading 600MB takes about 600MB / 14,000MB/s = 43ms.
Memory like DDR4 can do 25GB/s [2]. It can go over 600MB in 600MB / 25,000MB/s = 24ms.
L1/L2 can do 1TB/s [3]. There're 32 CPU's, so it's roughly 32TB/s of L1/L2 bandwidth. 600MB can be processed by 32TB/s in 0.018ms. With 3ms budget, they can process the 600MB data 166 times.
The rank selection algorithms like QuickSelect and Floyd-Rivest have O(N) complexity. It's entirely possible to process 600MB in 70ms.
[1] https://www.tomshardware.com/features/ssd-benchmarks-hierarc...
[2] https://www.transcend-info.com/Support/FAQ-292
[3] https://www.intel.com/content/www/us/en/developer/articles/t...
Strong and up to date intuition on "slow vs. fast" queries is an underrated software engineering skill. Reading blogs like this one is worth it just for that alone.
Slow VMs on overprovisioned cloud hosts which cost as much per month as a dedicated box per year have broken a generation of engineers.
You could host so much from your macbook. The average HN startup could be hosted on a $200 minipc from a closet for the first couple of years if not more - and I'm talking expensive here for the extra RAM you want to not restart every hour when you have a memory leak.
I don't see how that's the root cause. ClickHouse and snowflake run on your so-called slow vms on overprovisioned cloud hosts and they're efficient as hell. It's all about your optimizations.
The real problem is the lack of understanding by most engineers the degree of overprovisioning they do for code that's simple and doing stupid things using an inefficient 4th order language on top of 5 different useless (imo) abstractions.
No but my developer velocity! We should sacrifice literally everything else in order to enable me!!!! Nothing else matters except for my ease of life!
/s
Not only that, you have a pile of layers that could be advantageous in some situations but are an overkill in most.
I've seen Spark clusters being replaced by a single container using less than 1 CPU core and few 100s MB of RAM.
> so much from your macbook
At least on cloud I can actually have hundreds of GiBs of RAM. If I want this on my Macbook it's even more expensive than my cloud bill.
Strangely I've found inverse to be true: many backend technologies are actually quite good with memory management and often require as little as a few GiB of RAM or even less to serve production traffic. Often a single IDE consumes more RAM than a production Go binary that serves thousands of requests per second for example
You can, but if you need it you’re not searching for a product market fit anymore.
Raw compute wise, you're almost right (almost because real cloud hosts aren't overprovisioned, you get the full CPU/memory/disk reserved for you).
But you actually need more than compute. You might need a database, cache, message broker, scheduler, to send emails, and a million other things you can always DIY with FOSS software, but take time. If you have more money than time, get off the shelf services that provide those with guarantees and maintenance; if not, the DIY route is also great for learning.
My point is all of this can be hosted on a single bare metal box, a small one at that! We used to do just that back in mid naughts and computers only got faster. Half of those cloud services are preconfigured FOSS derivatives behind the scenes anyway (probably…)
>Despite the airport drama, I’m still set on that beach holiday, and that means loading my eReader with only the best.
What a nice touch. Technical information and diagrams in this were top notch, but the fact there was also some kind of narrative threaded in really put it over the top for me.
IMHO if ClickHouse had Windows native release that does not need WSL or a Linux virtual machine it would be more popular than DuckDB. I remember for years MySQL being way more popular than PostgreSQL. One of the reasons being MySQL had a Windows installer.
I was under impression that servers and databases generally run on Linux though.
Windows still runs on 71% of the desktop and laptops [1]. In my experience a good number of applications start life on simple desktops and then graduate to servers if they are successful. I work in the field of analytics. I have a locked down Windows desktop and I have been able to try out all the other databases such as MySQL, MariaDB, PostgreSQL and DuckDB because they have windows installers or portable apps. I haven't been able to try out ClickHouse. This is my experience and YMMV.
[1]https://en.wikipedia.org/wiki/Usage_share_of_operating_syste...
surely you have Docker though?
Is Clickhouse not already more popular than DuckDB?
28k stars on GitHub for DuckDB vs 40k for ClickHouse - pretty close. But, anecdotally, here on HN DuckDB gets mentioned much more often
Has anyone compared ClickHouse and StarRocks[0]? Join performance seems a lot better on StarRocks a few months ago but I'm not sure if that still holds true.
[0] https://www.starrocks.io/
Clickhouse is a masterpiece of modern engineering with absolute attention to performance.
Thought this was Clickhole.com and was waiting for the payoff to the joke
That’s an awesome change. Will that also work for limit offset queries?
It's quite amazing how a db like this shows that all of those row-based dbs are doing something wrong, they can't even approach these speeds with btree index structures. I know they like transactions more than Clickhouse, but it's just amazing to see how fast modern machines are, billions of rows per second.
I'm pretty sure they did not even bother to properly compress the dataset, with some tweaking, could have probably been much smaller than 30GBs. The speed shows that reading the data is slower than decompressing it.
Reminds me of that Cloudflare article where they had a similar idea about encryption being free (slower to read than to decrypt) and finding a bug, that when fixed, materialized this behavior.
The compute engine (chdb) is a wonder to use.
> It's quite amazing how a db like this shows that all of those row-based dbs are doing something wrong
They're not "doing something wrong". They are designed differently for different target workloads.
Row-based -> OLTP -> "Fetch the entire records from order table where user_id = XYZ"
Column-based -> OLAP -> "Compute the total amount of orders from the order table grouped by month/year"
Filtering by user id would also be trivially fast.
It’s transactions mostly that make things slow. Like various isolation levels, failures if stale data was read in a transaction etc.
I understand the difference, just a shame there’s nothing close to read or write rate , even on an index structure that has a copy of the columns.
I’m aware that similar partitioning is available and that improves write and read rate but not to these magnitudes .
Some of the “new SQL” hybrid (HTAP, hybrid transaction-analytical processing) databases might be of interest to you. TiDB is the main example off the top of my head.
look at who you’re arguing with ;)
Wonder how well this propagates down to subqueries/CTE's
Maybe I'm too inexperienced in this field but reading the mechanism I think this would be an obvious optimisation. Is it not?
But credit where it is due, obviously clickhouse is an industry leader.
Obvious solutions are often hard to do right. I bet the code that was needed to pull this off is either very complex or took a long time to write (and test). Or both.
This is a well-known class of optimization and the literature term is “late materialization”. It is a large set of strategies including this one. Late materialization is about as old as column stores themselves.
can we take the "packing your luggage" analogy and only pack the things we actually use in the trip and apply that to clickhouse?
Are you implying that ClickHouse is too large? You can build ClickHouse with most features disabled, it must be much smaller if you do that.
Reminder clickhouse can be optionally embedded, you don't need to reach for Duck just because of hype (it's buggy as hell everytime I tried it).
https://clickhouse.com/blog/chdb-embedded-clickhouse-rocket-...
Chdb is awesome but so is duckdb
is apache druid still a player in this space ? Never seem to hear about it anymore. why would someone choose it over clickhouse?
or Apache Doris ...I'm also curious
[dead]
[dead]