Let's say you have 100000 documents in your index that match your query but only 10 of them the user has access to:
A basic implementation will return the top, let's say 1000, documents and then do the more expensive access check on each of them. Most of the time, you've now eliminated all of your search results.
Your search must be access aware to do a reasonable job of pre-filtering the content to documents the user has access to, at which point you then can apply post-filtering with the "100% sure" access check.
Yes.
But this is still an incredibly well known and solved problem.
As an example - google's internal structured search engines did this decades ago at scale.
Just use a database that supports both filtering and vector search, such as postgres with pgvector (or any other, I think all are adding vector search nowadays).
The thing about a user needing access to only 10 documents is that creating a new index from scratch on those ten documents takes basically zero time.
Vector Databases intended for this purpose filter this way by default for exactly this reason. It doesn't matter how many documents are in the master index, it could be 100000 or 100000000,doesn't matter. Once you filter down to the 10 that your user is allowed to see, it takes the same tenth of a second or whatever to whip up a new bespoke index just for them for this query.
Pre-search filtering is only a problem when your filter captures a large portion of the original corpus, which is rare. How often are you querying "all documents that Joe Schmoe isn't allowed to view"?
"Fun" Fact: ServiceNow simply passes this problem on to its users.
I've seen a list of what was supposed to be 20 items of something, it only showed 2, plus a comment "18 results were omitted to insufficient permissions".
(Servicenow has at least three different ways to do permissions, I don't know if this applies to all of them).
> You'd have to reindex the metadata (roles access), which may be substantial if you have a complex enough schema with enough users/roles.
Right, but this compare this to the original proposal:
> A basic implementation will return the top, let's say 1000, documents and then do the more expensive access check on each of them
Using an index is much better than that.
And it should be possible to update the index without a substantial cost, since most of the 100000 documents likely aren't changing their role access very often. You only have to reindex a document's metadata when that changes.
This is also far less costly than updating the actual content index (the vector embeddings) when the document content changes, which you have to do regardless of your permissions model.
I don't understand how "using an index" is a solution to this problem. If you're doing search, then you already have an index.
If you use your index to get search results, then you will have a mix of roles that you then have to filter.
If you want to filter first, then you need to make a whole new search index from scratch with the documents that came out of the filter.
You can't use the same indexing information from the full corpus to search a subset, your classical search will have undefined IDF terms and your vector search will find empty clusters.
If you want quality search results and a filter, you have to commit to reindexing your data live at query time after the filter step and before the search step.
I don't think Elastic supports this (last time I used it it was being managed in a bizarre way, so I may be wrong). Azure AI Search does this by default. I don't know about others.
A basic implementation will return the top, let's say 1000, documents and then do the more expensive access check on each of them. Most of the time, you've now eliminated all of your search results.
Your search must be access aware to do a reasonable job of pre-filtering the content to documents the user has access to, at which point you then can apply post-filtering with the "100% sure" access check.