Not a great article; I clicked expecting something super technical about SQLite internals and found a mix of rdbms basics and some misconceptions. The limitations in the blog post aren't really specific to SQLite (for the most part), they're just how indexes (indices) and database engines work across the board. And some of the things phrased as "SQLite [can't] do this" is stuff that wouldn't make sense to do in the first place.
If you understand what (multi-column) indexes are at the lowest level (i.e. what data structure they represent, how they are used by the database, what the code reading them would look like) then all of this makes immediate and natural sense. Indexes aren't magic. They're just a shortcut to help the the db engine find your data more effectively.
This doesn't require super complex understanding of data structures, btw. The same limitations and advice would apply if you were trying to, for example, search a stack of resumes sorted by one or more attributes. If you have them sorted by position, years of experience, and applicant's last name; you wouldn't be able to quickly grab all resumes for the HR Manager position that have a last name starting with J - after all, you sorted them by years of experience first! It's physically not possible! You can only use the first part of that index, the position, to jump to HR Manager resumes; then you would need to manually go through them one-by-one to grab only the ones starting with J for each years-of-experience subgroup (if any).
some_old_guy 1 days ago [-]
The fact that it stops at the first range isn't intuitive to me at all, and I've been using sqlite for 20 years now.
Given a covering index on (thing_date, thing_color)
I would think a scan would not be needed for the query:
select thing_date, thing_color where thing_date < date and thing_color = 'blue'
I also can't think of a reason this is the case given the underlying data structures.
1 days ago [-]
ComputerGuru 1 days ago [-]
It's a flattened tree, and you've used the index to reach a point where you have multiple child nodes that meet the precondition thing_date < 17. Some, but not all, of these have thing_color = blue, but the important part is that they're non-consecutive (the index can only give you one point-of-entry, so the data has to be consecutive or you need to do some amount of scan).
I asked an llm to give me an ascii representation so it'll be easier to see what I mean; consider the case where you want thing_date < 17 and thing_color = blue, the results that match are marked with an asterisk, but note that there's no way to get to them (ALL of them) directly:
Root (internal)
+---------------------+
| keys: 14 17 |
+----+--------+--------+
| | |
v v v
+----------------+ +----------------+ +----------------+
| Leaf P0 | | Leaf P1 | | Leaf P2 |
| (dates 10-13) | | (dates 14-16) | | (dates 17-20) |
|----------------| |----------------| |----------------|
| 10 | red |101 | | 14 | blue |141| | 17 | green |171|
| 11 | blue |111 |*| 14 | red |142| | 18 | blue |181|
| 12 | green |121 | | 15 | yellow|151| | 19 | red |191|
| 13 | red |131 | | 16 | blue |161|*| 20 | green |201|
+----------------+ +----------------+ +----------------+
Go back to my resume's example: you have the resumes sorted by application timestamp (chronologically) and position (alphabetically). You want to find all resumes received last Tuesday where position was "Senior Architect". You literally can't skip to the point where the next n consecutive results will be the results you seek, because you can't temporarily re-order them and within the subset of resumes where date > monday and date < wednesday, there may be multiple for "Senior Architect" but they didn't arrive one-after-the-other so they're interspersed with results you don't want. The correct to do here would be (if this is always the shape of the query you expect to get) to sort the resumes by (position, timestamp) instead, at which point you absolutely CAN jump to the position that guarantees the next results are (all the results) of the query ("Senior Architect", last Tuesday).
One important thing to keep in mind is that a SCAN is not the end of the world. You know your data and you know when it's ok (you expect most of the results to match and a few to be filtered out, in which case you're really not saving much time anyway) and when it's not (the scan will cover a huge range that contains only a few, widely interspersed results and you have no way of terminating early).
EDIT: In response to your exact question however, note that with a covering index you might not end up with a SCAN (i.e. you don't hit the table) even though it uses only the partial index (thing_date) and not (thing_date, thing_color)! It's still possible for it to avoid hitting the backing table and might return the results directly from the index - something it wouldn't have been able to do if the index was only (thing_date).
lelandbatey 1 days ago [-]
The key thing is "an index is a flattened tree", and for all us devs who haven't thought about trees in many years or younger folks who might not yet know, that means it's conceptually a bunch of nested maps/objects/dictionaries where the keys at each "level" are the columns in that same "level" of the index. To use a little bit of python, here's a list of the raw maps in your ascii art DB:
Notice that since this index is built from nested maps, if we want to use this index to find things, we HAVE to first do it by checking the keys in the "outermost" map, the 'date' column. It's a compound index but its order matters. This 'order matters' property is true in our map-based index and it's also true in our SQLite based index. It's also true that because this 'date < 17' is a range criteria, it has to check individual keys in the outermost map, which constitutes a SCAN of the index (smaller than a scan of the DB, but a scan nonetheless). To then find everything matching 'color = blue', it has to individually check all the color values in the result of the prior date SCAN in order to get down to only those with a 'color = blue'. As you can see, this index of date -> color isn't super helpful for the query 'WHERE color = blue AND date < 17' query. A query this index would be good for is a query like 'WHERE color = red AND date = 14'. That would not require any scans; if we were writing application code such a query with this index would be like calling `index[14][red]` which as we all know is super fast.
A better index for this, which is a different index, comes from swapping the order of the columns when forming the index. Instead of (date, color), this better index would be (color, date). That index looks like this:
Now with this new index to fulfill the exact same query of 'WHERE color = blue AND date < 17', we can do a `index[blue]` and then do a single SCAN of that intermediary to find only those `date < 17`. This still means our query does a SCAN, but it's scanning a smaller set of values (only 4 instead of at least 8 with the previous index) and we only do one scan instead of two scans.
Indexes in general are not flattened trees; they are just trees. Using a Python map as a mental model is fraught with peril since those are implemented as hash tables, which don't have ordering (which most database indexes need to support). So it's the wrong model.
For multidimensional indexes, you don't need to do anything fancy about nesting; you just need to have keys that have more than one component. To a first order, string concatenation is a good enough model. So in your case, here's what the index looks like:
which is then organized in some sort of tree structure, like (the exact details don't matter):
['10 red', '12 green', # min and max values for the tree below
[
['10 red', 101],
['11 blue', 111],
['12 green', 121]
]],
['13 red', '14 red', # similar
[
['13 red', 131],
['14 blue', 141],
['14 red', 142]
]],
more nodes…
lelandbatey 1 days ago [-]
Yes, this is true. I chose a nested-maps example because while it is inaccurate for actual DB applications, it's very helpful for explaining the limits of index ordering and the limits that range queries have when interacting with indexes. It's helpful to use maps as a first explanation because they're one of the most used datastructures that nearly all languages have built in (yes, I know, C, I'm talking about the other 9 of top 10 languages), and many work-a-day developers and dabblers will use nearly _only_ lists and maps (and objects/classes/structs) in their language of choice. I could've used Python's OrderedDict but I figured if I was going to stray away from "the most straightforward possible case using the datastructures everyone is already using daily", I'd have been better off jumping straight to a custom tree, like you've done.
That's a great miniature example of such a tree!
derefr 1 days ago [-]
> You can only use the first part of that index, the position, to jump to HR Manager resumes; then you would need to manually go through them one-by-one to grab only the ones starting with J for each years-of-experience subgroup (if any).
You're phrasing that like this situation — relying on an index formed by (key 1 that you know, key 2 that you don't know, key 3 that you want to depend on) — necessarily implies a sequential walk of everything within the HR manager subgroup.
But it doesn't; within each key-1 partition of the index, you can binary-search to find the boundaries of the key-2 partitions; and within those, you can binary-search to find the all (or, more commonly, the first) last_name[0] = "J" entry/entries. It's actually not overly slow (i.e. is often still a win over a naive index-partial seq scan) if the cardinality ratio of of key-2 : key-3 isn't too extreme (i.e. if you're saving useful amounts of time by eliminating enough key-3 entries from consideration per key-2 partition.)
(We use this approach quite a bit at $WORK — MySQL and Postgres people like to call this a https://wiki.postgresql.org/wiki/Loose_indexscan. [See the "Making use of a non-leading column of an index" section.])
ComputerGuru 6 hours ago [-]
Well, yes, I omitted anything that may or may not happen to explain the general principle. Cardinality-based optimizations can and do take place, but they depend on the db being aware of the shape of your data (you need to analyze the tables often) and depend on internal factors and heuristics that are subject to change between releases, can't be assumed across different databases, and may or may not actually speed up the query (pathological cases certainly exist, even in the real world).
Dylan16807 15 hours ago [-]
Looking at the documentation, MySQL definitely has this feature now.
I don't think it understands when to use it.
I have a table where the primary key (A,B) is two integers. The first one is a single byte and only has a dozen distinct values. Any query I do based on just B ends up doing a table scan and being extremely slow. But if I add "WHERE A IN (1,2,3,4,5,6,7,8,9,10,11) AND" suddenly it's super fast.
So I'm stuck with a redundant index on just B if I don't want to add that cruft all over.
Anyway, yes, optimization is often possible in this kind of situation but don't trust your database engine to figure it out.
simscitizen 1 days ago [-]
I agree, all of these rules aren't the right way to teach about how to reason about this. All of the perf properties described should fall out of the understanding that both tables and indices in SQLite are B-trees. B-trees have the following properties:
- can look up a key or key prefix in O(log N) time ("seek a cursor" in DB parlance, or maybe "find/find prefix and return an iterator" for regular programmers)
- can iterate to next row in amortized O(1) time ("advance a cursor" in DB parlance, or maybe "advance an iterator" for regular programmers). Note that unordered data structures like hash maps don't have this property. So the mental model has to start with thinking that tables/indices are ordered data structures or you're already lost.
A table is a b+tree where the key is the rowid and the value is the row (well, except for WITHOUT ROWID tables).
An index is a b-tree where the key is the indexed column(s) and the value is a rowid.
And SQLite generally only does simple nested loop joins. No hash joins etc. Just the most obvious joining that you could do if you yourself wrote database-like logic using ordered data structures with the same perf properties e.g. std::map.
From this it ought to be pretty obvious why column order in an index matters, etc.
emmelaich 1 days ago [-]
The non-equivalence of .9 and 0.9 was certainly a surprise to me.
SQLite 24 hours ago [-]
Yeah, but who ever writes "x=0.9" as a constraint on a partial index? Really? Don't you know you aren't suppose to compare floating point quantities for equality?
If P is the expression on the partial index and Q is the WHERE clause of the query, then the partial index is only usable if Q implies P for all possible assignments of variables. A theorem prover is needed to establish this. Every RDBMS has one. The one inside SQLite is not terribly bright, true enough. It leans toward usually little memory and few CPU cycles. It does not do a good job if P contains "x=0.9". On the other hand, SQLite's theorem prover is decent if P contains "x IS NOT NULL", because in actual practice, probably about 90% of partial index WHERE clauses are some variation on "x IS NOT NULL".
The partial index expression does not always have to be exactly the same as what is in the WHERE clause of the query. SQLite will always find the match if P is a subset of Q; if Q can be rewritten as "R AND P". But if P is "x IS NOT NULL" and Q does anything that restricts x from being NULL, for example if Q contains "x>0", then SQLite's theorem prover will find that match too, even if "IS NOT NULL" never appears in Q.
Will the theorem prover in SQLite get better someday? Perhaps. It has gotten better over the years. The question becomes, how much more code space and query-planning CPU cycles are you willing to spend to get a slightly better query planner? This trade-off is different for a client/server database engine. With SQLite being embedded, the trade-off tends to fall more on the side of "keep it simple". If you have followed SQLite over many years, you might have noticed it is shifting toward more complex decision making as memory becomes cheaper and CPUs get faster. It's a tricky balancing act to find the sweet spot.
emmelaich 23 hours ago [-]
> Yeah, but who ever writes "x=0.9" as a constraint on a partial index?
Not me! But you have me curious now; does sqlite do a text comparison for the constraint? Surely (maybe not) 0.9 == .9?
Can you do a constraint as (int)(100 * x) <= 90?
PS. Thanks for sqlite!
SQLite 22 hours ago [-]
The top-level routine is here: <https://sqlite.org/src/info/aae36a5fbd17?ln=6767-6818>. Small (32-bit) integer literals are compared numerically, here: <https://sqlite.org/src/info/aae36a5fbd17?ln=6526>. They don't have to exactly match. So if you say "x=0x123" in the WHERE clause of the partial index and "x=291" in the WHERE clause of the query, and that will still work. However, 64-bit integer literals and floating-point literals are compared using strcmp(), here: <https://sqlite.org/src/info/aae36a5fbd17?ln=6570>, so they do need to match exactly, at least in the current implementation. Maybe that is something I should work on...
nikeee 1 days ago [-]
I use the mental model of nested maps for "column order matters".
For example, an index "published, low_quality_probability, lang" is just a Map<date, Map<number, Map<lang, rowId>>> in my mental model.
These maps are ordered by the order the index possesses. That explains why column order matters and why one cannot skip columns and why it stops at range queries.
Just imagine getting a final rowId from these nested maps and you'll see why the index works for some queries and doesn't for others.
masklinn 1 days ago [-]
It’s way easier if you think of the indexes as tuple keys in a binary tree.
Because they’re tuple keys in a b-tree. That also explains how ranges work efficiently.
jasonwatkinspdx 1 days ago [-]
Yeah, if you're going to work with databases regularly I think it's worth learning how b-trees work. It'll make a lot of things much more intuitive.
If you wanna get a very complete grounding in how the big rdbmses work, Andy Pavlo's lectures and class notes are fantastic.
runlaszlorun 1 days ago [-]
I definitely second Andy Pavlo's lectures.
adzm 1 days ago [-]
I use this same approach to explain indexes to database novices; it usually helps a lot, especially with how the leaf nodes / included indexes work as well (using that info instead of or in addition to rowid/primary key as the last value there)
1 days ago [-]
emschwartz 1 days ago [-]
That is a helpful way of thinking about it. Thanks for sharing!
immibis 1 days ago [-]
It's actually a List<Tuple<date, number, rowid>>> in sorted order, and queries are more akin to binary search (they are not actually binary but use a wider fanout depending on many factors)
nikeee 24 hours ago [-]
Yeah it might be closer to what's actually happening but it doesn't make it obvious why something doesn't work. The map model doesn't even cover non-unique indices. Or different type of indexes.
Sesse__ 1 days ago [-]
It's _not_ just a list in sorted order; if it were, you could not insert efficiently into it. But the tuple is the right mental model, indeed.
wodenokoto 1 days ago [-]
The article is a developers journey into indexes and not a bad journey or travelogue imho.
Sure, if you are a database expert, it might be disappointing but I enjoyed reading it.
emschwartz 1 days ago [-]
Thanks for saying that! That’s exactly how it was intended and I’m glad to hear you enjoyed it
andersmurphy 1 days ago [-]
Fun read.
I've been tripped up by the where in partial indexes before. Same goes for expression indexes.
emschwartz 1 days ago [-]
Thank you for saying that too!
Hope this explanation helped explain why, at least a little bit.
themafia 1 days ago [-]
You may not need a course. SQLite publishes a very useful guide on the query planner which goes over most of what you need to know with respect to indexes:
I agree with other commenters that there's nothing new or surprising here [1] if you actually understand how indexes work. Which, if you're working in SQL, you should.
But the main takeaway, I disagree with. The author explains:
> When I first set up Scour's database, I put a bunch of indexes on the items table without really thinking about whether they would help. For example, I had separate indexes on the published date, the language, and the quality rating. Useless. It's more important to have one or a small handful of good composite indexes on multiple columns than to have separate indexes on each column.
Yes, the biggest error is throwing indexes at a table without having the slightest idea if they're helpful. But the idea that a smaller number of composite indexes is not the answer either.
The answer is to go through every single query and make sure you have an index that matches the query, adding/changing indexes (including composite indexes) or rewriting queries as required.
Indexes don't exist in a vacuum. They are optimizations for specific queries. Proper database design requires knowing exactly what information you will need to look up using which parameters, and designing the tables and indexes to produce that performantly. When you need to look up data in new ways, you often need to add new indexes or rearchitect tables entirely.
[1] Except that the partial index match condition treats 0.9 and .9 as different. That is unexpected to me, but it is kind of in the docs at "The terms in W and X must match exactly": https://www.sqlite.org/partialindex.html
DaiPlusPlus 1 days ago [-]
> Yes, the biggest error is throwing indexes at a table without having the slightest idea if they're helpful
Unless you have a small number of static queries, this task isn’t really possible without an observability-solution (reporting the raw query text, the actual execution plan, and runtime and IO stats, et cetera) - otherwise it’s guesswork.
…and worse-still if your application has runtime-defined queries, such as having a custom filter builder in your UI. Actually I’ll admit I have no idea how platforms like Jira are able to do that with decent performance (can anyone let me know?)
I heap praise on MSSQL’s Query Store feature - which is still a relatively recent addition. I have no idea how anyone could manage query performance without needing 10x the time and effort needed to attach a profiler to a prod DB. …and if these are “edge” databases running on a 100k IoT devices then have fun, I guess.
crazygringo 1 days ago [-]
It's not guesswork at all. It basically just requires knowing what it is allowed to look up rows by. Execution plans generally aren't rocket science. And if someone messes up in writing their query, yes that should show up in query execution time stats. You don't need 10x anything to track average query times and spot outliers.
For runtime-defined queries, those are obviously not going to have global indexes if there are more than a couple possible fields. But as long as you're only iterating over, say, 10K rows that an index can determine belong to that customer, and this query is happening once per page rather than 200 times per page, that's fine. That's why you can search over 10K bug reports for a project without a problem, because they only belong to a specific project. You're not searching all 100M bug reports belonging to all clients.
immibis 1 days ago [-]
People who are deeper into databases differentiate between OLTP and OLAP workloads. OLTP - on-line transaction processing - mostly consists of a finite set of queries that each access a small amount of data, like you when you pay a bill at a bank. OLAP - on-line analytical processing - consists of mostly summaries of large amounts of data which can be ad-hoc, like the banker who wants to know the total transactions for the day. The two kinds of workloads are very different - so much so that some systems even periodically export the whole transaction database and re-import into a separate analytics DBMS designed for OLAP work.
Ameo 1 days ago [-]
The main takeaway from this for me is that SQLite’s query planner seems to be pretty limited. It’s reliant on stuff like the order in which WHERE conditions are specified, isn’t able to use multiple indexes in queries in many cases, bails out to scans when a variety of different operations show up in queries, etc.
It might be the case that SQLite has a simpler or less sophisticated query planner than other databases like Postgres or MariaDB, but in my experience those DBs struggle a lot with good querying planning as well. I’ve spent many hours in the past with issues like Postgres suddenly starting to ignore an index entirely because its computed table data distribution statistics got out of balance, or having to add manual annotations to MariaDB queries like STRAIGHT_JOIN in order to get a query to run faster.
I’m guessing that this is a really hard problem since it doesn’t seem to be really “solved” by any major DB vendor I’ve seen. A lot of modern DB engines like Clickhouse tend to just work around this problem by being so fast at full table scans that they don’t even need any sophisticated indexing set up at all.
crazygringo 1 days ago [-]
> The main takeaway from this for me is that SQLite’s query planner seems to be pretty limited.
This doesn't appear to be true at all.
The order of WHERE conditions does not matter; the order of columns in an index does.
Everything you're describing is pretty much just how indexes fundamentally work in all databases. Which is why you're saying it hasn't been "solved" by anyone.
Indexes aren't magic -- if you understand how they work as a tree, it becomes very clear what can be optimized and what can't.
It is true that occasionally query planners get it wrong, but it's also often the case that your query was written in a non-obvious way that is equivalent in terms of its formal results, but is not idiomatic -- and making it more idiomatic means the query planner can more easily understand which indexes to use where.
Ameo 1 days ago [-]
(copying my reply from the other comment that said the same thing as you)
The order of conditions in a WHERE definitely does matter, especially in cases where the conditions are on non-indexed columns or there are CPU-intensive search operations like regex, string ops, etc.
I just ran this test locally with a table I created that has 50 million rows:
```
» time sqlite3 test.db "select count() from test WHERE a != 'a' AND a != 'b' AND a != 'c' AND a != 'd' AND b != 'c' AND d != 'd' AND e != 'f' AND f = 'g'"
sqlite3 test.db 5.50s user 0.72s system 99% cpu 6.225 total
» time sqlite3 test.db "select count() from test WHERE f = 'g' AND a != 'a' AND a != 'b' AND a != 'c' AND a != 'd' AND b != 'c' AND d != 'd' AND e != 'f'"
sqlite3 test.db 1.51s user 0.72s system 99% cpu 2.231 total
```
The only difference is swapping the `f = 'g'` condition from last to first. That condition never matches in this query, so it's able to fail fast and skip all of the work of checking the other conditions.
crazygringo 1 days ago [-]
Sorry, I should have clarified -- the order of WHERE conditions doesn't matter for whether an index is utilized. I thought that was the context of the original comment, but now I realize maybe it was unclear.
Yes, of course you can skip evaluating other conditions if an AND fails and that can affect speed. So that's the same as most programming languages.
SkiFire13 1 days ago [-]
> A lot of modern DB engines like Clickhouse tend to just work around this problem by being so fast at full table scans that they don’t even need any sophisticated indexing set up at all.
There's only so much you can do with this approach due how to the algorithmic complexity scales as more joins are added. At some points you'll need some additional data structures to speed things up, though they not be indexes in name (e.g. materialized views)
setr 1 days ago [-]
Clickhouse isn’t fast at table scans, it’s just columnar. Indexes are basically a maintained transform from row storage to column storage; columnar databases are essentially already “indexed” by their nature (and they auto-apply some additional indexes on top, like zone maps). It’s only fast for table-scans in the sense that you probably aren’t doing a select * from table, so it’s only iterating over a few columns of data, whereas SQLite would end up iterating over literally everything — a table-scan doesn’t really mean the same thing between the two (a columnar database’s worst fear is selecting every column; a row-base database wants to avoid selecting every row)
Their problem is instead that getting back to a row, even within a table, is essentially a join. Which is why they fundamentally suck at point lookups, and they strongly favor analytic queries that largely work column-wise.
immibis 1 days ago [-]
Columnar databases are not "already "indexed"". Their advantage instead comes from their ability to only load the relevant parts of rows when doing scans.
jasonwatkinspdx 1 days ago [-]
"The indexes are the database" is a common perspective in column database implementations because it works quite well for ad hoc OLAP.
setr 23 hours ago [-]
They’re indexed in the sense that they’re already halfway to the structure of an index — which is why they’re happy to toss indexes on top arbitrarily, instead of demanding the user to manage a minimum subset.
SkiFire13 20 hours ago [-]
What does it even mean to be "halfway" to the structure of an index? Do they allow filtering a subset of rows with a complexity that's less than linear in the total number of rows or not?
emschwartz 1 days ago [-]
Just to clarify one thing: the order of WHERE conditions in a query does not matter. The order of columns in an index does.
Ameo 1 days ago [-]
It definitely does matter, especially in cases where the conditions are on non-indexed columns or there are CPU-intensive search operations like regex, string ops, etc.
I just ran this test locally with a table I created that has 50 million rows:
```
» time sqlite3 test.db "select count() from test WHERE a != 'a' AND a != 'b' AND a != 'c' AND a != 'd' AND b != 'c' AND d != 'd' AND e != 'f' AND f = 'g'"
sqlite3 test.db 5.50s user 0.72s system 99% cpu 6.225 total
» time sqlite3 test.db "select count() from test WHERE f = 'g' AND a != 'a' AND a != 'b' AND a != 'c' AND a != 'd' AND b != 'c' AND d != 'd' AND e != 'f'"
sqlite3 test.db 1.51s user 0.72s system 99% cpu 2.231 total
```
The only difference is swapping the `f = 'g'` condition from last to first. That condition never matches in this query, so it's able to fail fast and skip all of the work of checking the other conditions.
kccqzy 1 days ago [-]
> It's worth being careful to only add indexes that will be used by real queries.
This reminds me of a technique used by Google App Engine SDK a long time ago before it was called cloud. Basically in development mode, the SDK captures the kind of queries you make, and then automatically add any index that would speed up this query into a configuration file. You then later deploy with this configuration file, which tells the production data store to create such indexes. I thought this was a genius idea. Not sure if something similar exists for SQLite.
It's not quite the same as capturing all of the queries used in development (or production), but it seems somewhat useful.
I'll also note that I had an LLM generate quite a useful script to identify unused indexes (it scanned the code base for SQL queries, ran `EXPLAIN QUERY PLAN` on each one to identify which indexes were being used, and cross-referenced that against the indexes in the database to find unused ones). It would probably be possible to do something similar (but definitely imperfect) where you find all of the queries, get the query plans, and use an LLM to make suggestions about what indexes would speed up those queries.
zeroq 23 hours ago [-]
wow!
I used to bruteforce a bunch of indexes until EXPLAIN on queries gave satisfactory results!
I actually looked for a tool where I could provide a schema and all queries to get optimal indexes but never found one that actually worked.
crazygringo 1 days ago [-]
Every index you add uses a whole lot more disk space (a big deal) and slows down writes (usually not a big deal).
The idea of a tool automatically adding indexes scares me a little.
Sometimes you specifically want a query to take 0.003s instead of 0.001 because you don't want to use another 10GB of disk space.
1 days ago [-]
Waterluvian 1 days ago [-]
This seems like the kind of tool that would be useful for all indexable databases.
I imagine you can’t definitively know and therefore just make this automagical. But I bet the result is a pretty solid shortlist for consideration.
rtpg 23 hours ago [-]
Postgres is pretty good about using index data from multiple single column indices... if you don't order results! Otherwise the planner will get in the way and do nonsense.
A bunch of people spend years studying computer science and trees etc, and then when we actually need to care about that stuff the databases absolutely do not want us to declare the plan. Very annoying.
Think about how people would be guided to do the right thing on indices if they had to say "go along this index to get my data" in a "planned mode" when dealing with bad performance? So many data layout and index issues would become immediately obvious. You wouldn't magically get as many benefits from DB updates, granted.
porridgeraisin 1 days ago [-]
> multiple indices
Yes, sqlite uses only one index per table in FROM clause... Except for when it can use the OR optimization [1]
> Left to right, no skipping, stops at the first range
I don't know if we need a soundbite for this, nor is it sqlite specific. This is a direct consequence of the fact that indexes are basically just sorted lists you can binary search over. They are sorted left to right (you have to choose some order anyhow) meaning all the other orderings cannot avail of the speedup, and thus sqlite chooses to not use it.....
Except if left column of index is measured to be low cardinality by ANALYZE [2]'s output in the sqlite_stats table, in which case it is ok to scan over it and then binary search right column. More at skip-scan optimization [3].
One thing I'm missing in SQLite are multi-valued indexes. I would want to have an index on func_returning_list(some_column), and query doing WHERE x IN func_returning_list(some_column) should use that index.
Rendered at 00:23:07 GMT+0000 (Coordinated Universal Time) with Vercel.
If you understand what (multi-column) indexes are at the lowest level (i.e. what data structure they represent, how they are used by the database, what the code reading them would look like) then all of this makes immediate and natural sense. Indexes aren't magic. They're just a shortcut to help the the db engine find your data more effectively.
This doesn't require super complex understanding of data structures, btw. The same limitations and advice would apply if you were trying to, for example, search a stack of resumes sorted by one or more attributes. If you have them sorted by position, years of experience, and applicant's last name; you wouldn't be able to quickly grab all resumes for the HR Manager position that have a last name starting with J - after all, you sorted them by years of experience first! It's physically not possible! You can only use the first part of that index, the position, to jump to HR Manager resumes; then you would need to manually go through them one-by-one to grab only the ones starting with J for each years-of-experience subgroup (if any).
Given a covering index on (thing_date, thing_color)
I would think a scan would not be needed for the query:
select thing_date, thing_color where thing_date < date and thing_color = 'blue'
I also can't think of a reason this is the case given the underlying data structures.
I asked an llm to give me an ascii representation so it'll be easier to see what I mean; consider the case where you want thing_date < 17 and thing_color = blue, the results that match are marked with an asterisk, but note that there's no way to get to them (ALL of them) directly:
Go back to my resume's example: you have the resumes sorted by application timestamp (chronologically) and position (alphabetically). You want to find all resumes received last Tuesday where position was "Senior Architect". You literally can't skip to the point where the next n consecutive results will be the results you seek, because you can't temporarily re-order them and within the subset of resumes where date > monday and date < wednesday, there may be multiple for "Senior Architect" but they didn't arrive one-after-the-other so they're interspersed with results you don't want. The correct to do here would be (if this is always the shape of the query you expect to get) to sort the resumes by (position, timestamp) instead, at which point you absolutely CAN jump to the position that guarantees the next results are (all the results) of the query ("Senior Architect", last Tuesday).One important thing to keep in mind is that a SCAN is not the end of the world. You know your data and you know when it's ok (you expect most of the results to match and a few to be filtered out, in which case you're really not saving much time anyway) and when it's not (the scan will cover a huge range that contains only a few, widely interspersed results and you have no way of terminating early).
EDIT: In response to your exact question however, note that with a covering index you might not end up with a SCAN (i.e. you don't hit the table) even though it uses only the partial index (thing_date) and not (thing_date, thing_color)! It's still possible for it to avoid hitting the backing table and might return the results directly from the index - something it wouldn't have been able to do if the index was only (thing_date).
A better index for this, which is a different index, comes from swapping the order of the columns when forming the index. Instead of (date, color), this better index would be (color, date). That index looks like this:
Now with this new index to fulfill the exact same query of 'WHERE color = blue AND date < 17', we can do a `index[blue]` and then do a single SCAN of that intermediary to find only those `date < 17`. This still means our query does a SCAN, but it's scanning a smaller set of values (only 4 instead of at least 8 with the previous index) and we only do one scan instead of two scans.Anyway, here's a gist of some Python if you want to play with this concept: https://gist.github.com/lelandbatey/d09557fed38c48a797bf1b15...
For multidimensional indexes, you don't need to do anything fancy about nesting; you just need to have keys that have more than one component. To a first order, string concatenation is a good enough model. So in your case, here's what the index looks like:
which is then organized in some sort of tree structure, like (the exact details don't matter):That's a great miniature example of such a tree!
You're phrasing that like this situation — relying on an index formed by (key 1 that you know, key 2 that you don't know, key 3 that you want to depend on) — necessarily implies a sequential walk of everything within the HR manager subgroup.
But it doesn't; within each key-1 partition of the index, you can binary-search to find the boundaries of the key-2 partitions; and within those, you can binary-search to find the all (or, more commonly, the first) last_name[0] = "J" entry/entries. It's actually not overly slow (i.e. is often still a win over a naive index-partial seq scan) if the cardinality ratio of of key-2 : key-3 isn't too extreme (i.e. if you're saving useful amounts of time by eliminating enough key-3 entries from consideration per key-2 partition.)
(We use this approach quite a bit at $WORK — MySQL and Postgres people like to call this a https://wiki.postgresql.org/wiki/Loose_indexscan. [See the "Making use of a non-leading column of an index" section.])
I don't think it understands when to use it.
I have a table where the primary key (A,B) is two integers. The first one is a single byte and only has a dozen distinct values. Any query I do based on just B ends up doing a table scan and being extremely slow. But if I add "WHERE A IN (1,2,3,4,5,6,7,8,9,10,11) AND" suddenly it's super fast.
So I'm stuck with a redundant index on just B if I don't want to add that cruft all over.
Anyway, yes, optimization is often possible in this kind of situation but don't trust your database engine to figure it out.
- can look up a key or key prefix in O(log N) time ("seek a cursor" in DB parlance, or maybe "find/find prefix and return an iterator" for regular programmers)
- can iterate to next row in amortized O(1) time ("advance a cursor" in DB parlance, or maybe "advance an iterator" for regular programmers). Note that unordered data structures like hash maps don't have this property. So the mental model has to start with thinking that tables/indices are ordered data structures or you're already lost.
A table is a b+tree where the key is the rowid and the value is the row (well, except for WITHOUT ROWID tables).
An index is a b-tree where the key is the indexed column(s) and the value is a rowid.
And SQLite generally only does simple nested loop joins. No hash joins etc. Just the most obvious joining that you could do if you yourself wrote database-like logic using ordered data structures with the same perf properties e.g. std::map.
From this it ought to be pretty obvious why column order in an index matters, etc.
If P is the expression on the partial index and Q is the WHERE clause of the query, then the partial index is only usable if Q implies P for all possible assignments of variables. A theorem prover is needed to establish this. Every RDBMS has one. The one inside SQLite is not terribly bright, true enough. It leans toward usually little memory and few CPU cycles. It does not do a good job if P contains "x=0.9". On the other hand, SQLite's theorem prover is decent if P contains "x IS NOT NULL", because in actual practice, probably about 90% of partial index WHERE clauses are some variation on "x IS NOT NULL".
The partial index expression does not always have to be exactly the same as what is in the WHERE clause of the query. SQLite will always find the match if P is a subset of Q; if Q can be rewritten as "R AND P". But if P is "x IS NOT NULL" and Q does anything that restricts x from being NULL, for example if Q contains "x>0", then SQLite's theorem prover will find that match too, even if "IS NOT NULL" never appears in Q.
Will the theorem prover in SQLite get better someday? Perhaps. It has gotten better over the years. The question becomes, how much more code space and query-planning CPU cycles are you willing to spend to get a slightly better query planner? This trade-off is different for a client/server database engine. With SQLite being embedded, the trade-off tends to fall more on the side of "keep it simple". If you have followed SQLite over many years, you might have noticed it is shifting toward more complex decision making as memory becomes cheaper and CPUs get faster. It's a tricky balancing act to find the sweet spot.
Not me! But you have me curious now; does sqlite do a text comparison for the constraint? Surely (maybe not) 0.9 == .9?
Can you do a constraint as (int)(100 * x) <= 90?
PS. Thanks for sqlite!
Just imagine getting a final rowId from these nested maps and you'll see why the index works for some queries and doesn't for others.
Because they’re tuple keys in a b-tree. That also explains how ranges work efficiently.
If you wanna get a very complete grounding in how the big rdbmses work, Andy Pavlo's lectures and class notes are fantastic.
Sure, if you are a database expert, it might be disappointing but I enjoyed reading it.
I've been tripped up by the where in partial indexes before. Same goes for expression indexes.
Hope this explanation helped explain why, at least a little bit.
https://www.sqlite.org/optoverview.html
But the main takeaway, I disagree with. The author explains:
> When I first set up Scour's database, I put a bunch of indexes on the items table without really thinking about whether they would help. For example, I had separate indexes on the published date, the language, and the quality rating. Useless. It's more important to have one or a small handful of good composite indexes on multiple columns than to have separate indexes on each column.
Yes, the biggest error is throwing indexes at a table without having the slightest idea if they're helpful. But the idea that a smaller number of composite indexes is not the answer either.
The answer is to go through every single query and make sure you have an index that matches the query, adding/changing indexes (including composite indexes) or rewriting queries as required.
Indexes don't exist in a vacuum. They are optimizations for specific queries. Proper database design requires knowing exactly what information you will need to look up using which parameters, and designing the tables and indexes to produce that performantly. When you need to look up data in new ways, you often need to add new indexes or rearchitect tables entirely.
[1] Except that the partial index match condition treats 0.9 and .9 as different. That is unexpected to me, but it is kind of in the docs at "The terms in W and X must match exactly": https://www.sqlite.org/partialindex.html
Unless you have a small number of static queries, this task isn’t really possible without an observability-solution (reporting the raw query text, the actual execution plan, and runtime and IO stats, et cetera) - otherwise it’s guesswork.
…and worse-still if your application has runtime-defined queries, such as having a custom filter builder in your UI. Actually I’ll admit I have no idea how platforms like Jira are able to do that with decent performance (can anyone let me know?)
I heap praise on MSSQL’s Query Store feature - which is still a relatively recent addition. I have no idea how anyone could manage query performance without needing 10x the time and effort needed to attach a profiler to a prod DB. …and if these are “edge” databases running on a 100k IoT devices then have fun, I guess.
For runtime-defined queries, those are obviously not going to have global indexes if there are more than a couple possible fields. But as long as you're only iterating over, say, 10K rows that an index can determine belong to that customer, and this query is happening once per page rather than 200 times per page, that's fine. That's why you can search over 10K bug reports for a project without a problem, because they only belong to a specific project. You're not searching all 100M bug reports belonging to all clients.
It might be the case that SQLite has a simpler or less sophisticated query planner than other databases like Postgres or MariaDB, but in my experience those DBs struggle a lot with good querying planning as well. I’ve spent many hours in the past with issues like Postgres suddenly starting to ignore an index entirely because its computed table data distribution statistics got out of balance, or having to add manual annotations to MariaDB queries like STRAIGHT_JOIN in order to get a query to run faster.
I’m guessing that this is a really hard problem since it doesn’t seem to be really “solved” by any major DB vendor I’ve seen. A lot of modern DB engines like Clickhouse tend to just work around this problem by being so fast at full table scans that they don’t even need any sophisticated indexing set up at all.
This doesn't appear to be true at all.
The order of WHERE conditions does not matter; the order of columns in an index does.
Everything you're describing is pretty much just how indexes fundamentally work in all databases. Which is why you're saying it hasn't been "solved" by anyone.
Indexes aren't magic -- if you understand how they work as a tree, it becomes very clear what can be optimized and what can't.
It is true that occasionally query planners get it wrong, but it's also often the case that your query was written in a non-obvious way that is equivalent in terms of its formal results, but is not idiomatic -- and making it more idiomatic means the query planner can more easily understand which indexes to use where.
The order of conditions in a WHERE definitely does matter, especially in cases where the conditions are on non-indexed columns or there are CPU-intensive search operations like regex, string ops, etc.
I just ran this test locally with a table I created that has 50 million rows:
``` » time sqlite3 test.db "select count() from test WHERE a != 'a' AND a != 'b' AND a != 'c' AND a != 'd' AND b != 'c' AND d != 'd' AND e != 'f' AND f = 'g'" sqlite3 test.db 5.50s user 0.72s system 99% cpu 6.225 total » time sqlite3 test.db "select count() from test WHERE f = 'g' AND a != 'a' AND a != 'b' AND a != 'c' AND a != 'd' AND b != 'c' AND d != 'd' AND e != 'f'" sqlite3 test.db 1.51s user 0.72s system 99% cpu 2.231 total ```
The only difference is swapping the `f = 'g'` condition from last to first. That condition never matches in this query, so it's able to fail fast and skip all of the work of checking the other conditions.
Yes, of course you can skip evaluating other conditions if an AND fails and that can affect speed. So that's the same as most programming languages.
There's only so much you can do with this approach due how to the algorithmic complexity scales as more joins are added. At some points you'll need some additional data structures to speed things up, though they not be indexes in name (e.g. materialized views)
Their problem is instead that getting back to a row, even within a table, is essentially a join. Which is why they fundamentally suck at point lookups, and they strongly favor analytic queries that largely work column-wise.
I just ran this test locally with a table I created that has 50 million rows:
``` » time sqlite3 test.db "select count() from test WHERE a != 'a' AND a != 'b' AND a != 'c' AND a != 'd' AND b != 'c' AND d != 'd' AND e != 'f' AND f = 'g'" sqlite3 test.db 5.50s user 0.72s system 99% cpu 6.225 total » time sqlite3 test.db "select count() from test WHERE f = 'g' AND a != 'a' AND a != 'b' AND a != 'c' AND a != 'd' AND b != 'c' AND d != 'd' AND e != 'f'" sqlite3 test.db 1.51s user 0.72s system 99% cpu 2.231 total ```
The only difference is swapping the `f = 'g'` condition from last to first. That condition never matches in this query, so it's able to fail fast and skip all of the work of checking the other conditions.
This reminds me of a technique used by Google App Engine SDK a long time ago before it was called cloud. Basically in development mode, the SDK captures the kind of queries you make, and then automatically add any index that would speed up this query into a configuration file. You then later deploy with this configuration file, which tells the production data store to create such indexes. I thought this was a genius idea. Not sure if something similar exists for SQLite.
It's not quite the same as capturing all of the queries used in development (or production), but it seems somewhat useful.
I'll also note that I had an LLM generate quite a useful script to identify unused indexes (it scanned the code base for SQL queries, ran `EXPLAIN QUERY PLAN` on each one to identify which indexes were being used, and cross-referenced that against the indexes in the database to find unused ones). It would probably be possible to do something similar (but definitely imperfect) where you find all of the queries, get the query plans, and use an LLM to make suggestions about what indexes would speed up those queries.
I used to bruteforce a bunch of indexes until EXPLAIN on queries gave satisfactory results!
I actually looked for a tool where I could provide a schema and all queries to get optimal indexes but never found one that actually worked.
The idea of a tool automatically adding indexes scares me a little.
Sometimes you specifically want a query to take 0.003s instead of 0.001 because you don't want to use another 10GB of disk space.
I imagine you can’t definitively know and therefore just make this automagical. But I bet the result is a pretty solid shortlist for consideration.
A bunch of people spend years studying computer science and trees etc, and then when we actually need to care about that stuff the databases absolutely do not want us to declare the plan. Very annoying.
Think about how people would be guided to do the right thing on indices if they had to say "go along this index to get my data" in a "planned mode" when dealing with bad performance? So many data layout and index issues would become immediately obvious. You wouldn't magically get as many benefits from DB updates, granted.
Yes, sqlite uses only one index per table in FROM clause... Except for when it can use the OR optimization [1]
> Left to right, no skipping, stops at the first range
I don't know if we need a soundbite for this, nor is it sqlite specific. This is a direct consequence of the fact that indexes are basically just sorted lists you can binary search over. They are sorted left to right (you have to choose some order anyhow) meaning all the other orderings cannot avail of the speedup, and thus sqlite chooses to not use it.....
Except if left column of index is measured to be low cardinality by ANALYZE [2]'s output in the sqlite_stats table, in which case it is ok to scan over it and then binary search right column. More at skip-scan optimization [3].
[1] https://www.sqlite.org/optoverview.html#evaluating_or_constr...
[2] https://www.sqlite.org/lang_analyze.html
[3] https://www.sqlite.org/optoverview.html#the_skip_scan_optimi...