Search is, by it’s very nature, a complex problem. Just when you feel proud of yourself for getting everything pertinent stored in a DB, getting the right columns indexed, making find operations fast and efficient, someone’s going to ask ‘now I want to search the complete text of every post, and-
Okay, so that’s difficult, right?
And you’ll see some pretty kludgey solutions out there: using google indexing for your site, or just restricting searches to titles rather than full text. What are the key points in getting a really complete site search feature working?
1) Do we really need full-text search?
A lot of the wizardry we’re going to talk about feels like just that: wizardry. It’s easy to get caught up in algorithmic speed testing before thinking about what the business needs of this search really are.
If we’re letting a user search our e-commerce site, searching the text of reviews might be super unhelpful.
If we’re going to end up wanting to manipulate these results later, it would be a lot easier to just have ‘search’ go through manually selected tags than full text. E.g: when the user searches for ‘Ray Ban Sunglasses’ we want to offer him the cheap (and higher margin) substitutes. Rather than putting in some weird exception to our full-text search system, it would be a lot easier to just have search look at tags first, then we could tag whatever we really wanted without having to put our ‘thumb on the scale.’
2) do we really really need full text search?
trying to search, and even worse live-search, to work with a complex data set is a complex task, and we need to have user stories that justify the time and effort.
One question most ‘search’ user stories need to answer: would an extract or frequency table work equally well?
To be more specific, what about exporting a frequency table of words from the full text, removing the particles, and just suggesting the ‘matches’ from the top 10,000 words? What about an object containing frequently searched terms, with an array of post ID’s? All of these could be indexed ahead of time, with hourly re-indexing, for super fast performance, they all look a lot like full-text search, and they’re easier to manipulate.
You’re still here?
Okay, so let’s talk about how to make Postgres really sing through full-text search.
1) Get your indexes tuned
I came to programming from professional card playing, not a CS program, so I don’t always have the theoretical knowledge I’d like on every subject.
I do know this though: the speed with which you can search a set depends a lot on what you know about the set before hand
add index for any column you’ll repeatedly query against is a really smart first step, but you should go back and mark every index for columns that you know are unique.
CREATE UNIQUE INDEX name ON table (column [, …]);
The reason why this helps is pretty easy to imagine: even with an index that shows you where you can find records of type ‘x’ is great, but it saves a ton of time if you know this listing is the only listing of its type in the index.
2) make what structure you can
I call this the ‘title subtitle optimization:’ if you’re concerned about search performance later, as much as possible give your stored information structure. That means if each post has a title and a subtitle, store them in separate columns rather than one text blob with a few tags.
The advantage of this approach mainly comes up in page ranking: it’s kind of a gimme that every match to a page title should be ranked above pages with the search term deep in the description.
Also: the shorter the blobs you’re searching in Postgres the faster your results will be.
3) your data has structure whether you know it or not
tl:dr version: use JSONB to store arbitrary structured data
It’s not actually possible to index (or for that matter compress) fully random data but good news: you’re a thinking machine not a bag o’ dice, you really can’t produce random output even if you wanted to.
JSONB lets you store structured data that can be indexed without knowing anything about its structure. and vastly improves performance
4) GIN indexing is your friend
I’d go so far as to call it your best friend. While it can index arbitrary structured data as JSONB, it’ll also let you index arbitrary text
CREATE INDEX idxgin ON api USING GIN (jdoc);
gin indexing was significantly improved in 9.4
Remember that you’ll need to format the query to tsvector to get to actually use this index. e.g.
NEW.tsv = to_tsvector(array_to_string(array( select json_array_elements(NEW.items)->>’name’ ),’ ‘));
I could not have written this article, nor had my career in programming, without the writing of David Bryant Copeland on Postgres