Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Let's say you have 100000 documents in your index that match your query but only 10 of them the user has access to:

A basic implementation will return the top, let's say 1000, documents and then do the more expensive access check on each of them. Most of the time, you've now eliminated all of your search results.

Your search must be access aware to do a reasonable job of pre-filtering the content to documents the user has access to, at which point you then can apply post-filtering with the "100% sure" access check.



Yes. But this is still an incredibly well known and solved problem. As an example - google's internal structured search engines did this decades ago at scale.


Which solutions are you referring to? With access that is highly diverse and changing, this is still an unsolved problem to my knowledge.


Probably Google Zanzibar (and the various non-Google systems that were created as a result of the paper describing Zanzibar).


Just use a database that supports both filtering and vector search, such as postgres with pgvector (or any other, I think all are adding vector search nowadays).


Agree...as simple as:

@pxt.query def search_documents(query_text: str, user_id: str): sim = chunks.text.similarity(query_text) return ( chunks.where( (chunks.user_id == user_id) # Metadata filtering & (sim > 0.5) # Filter by similarity threshold & (pxt_str.len(chunks.text) > 30) # Additional filter/transformation ) .order_by(sim, asc=False) .select( chunks.text, source_doc=chunks.document, # Ref to the original document sim=sim, title=chunks.title, heading=chunks.heading, page_number=chunks.page ) .limit(20) )

For instance in https://github.com/pixeltable/pixeltable


The thing about a user needing access to only 10 documents is that creating a new index from scratch on those ten documents takes basically zero time.

Vector Databases intended for this purpose filter this way by default for exactly this reason. It doesn't matter how many documents are in the master index, it could be 100000 or 100000000,doesn't matter. Once you filter down to the 10 that your user is allowed to see, it takes the same tenth of a second or whatever to whip up a new bespoke index just for them for this query.

Pre-search filtering is only a problem when your filter captures a large portion of the original corpus, which is rare. How often are you querying "all documents that Joe Schmoe isn't allowed to view"?


If you can move your access check to the DB layer, you skip a lot of this trouble.

Index your ACLs, index your users, index your docs. Your database can handle it.


Apache Accumulo solved the access-aware querying a while ago.


"Fun" Fact: ServiceNow simply passes this problem on to its users.

I've seen a list of what was supposed to be 20 items of something, it only showed 2, plus a comment "18 results were omitted to insufficient permissions".

(Servicenow has at least three different ways to do permissions, I don't know if this applies to all of them).


I'm not sure if enumerating the hidden results are a great idea :0


At least it's terrible user experience to have to click on the "more" button several times to see the number of items you actually wanted to see.

But yes, one could probably also construct a series of queries that reveal properties of hidden objects.


> Let's say you have 100000 documents in your index that match your query

If the docs were indexed by groups/roles and you had some form of RBAC then this wouldn't happen.


If you take this approach, you have to reindex when groups/roles changes - not always a feasible choice


You only have to update the metadata, not do a full reindex.


You'd have to reindex the metadata (roles access), which may be substantial if you have a complex enough schema with enough users/roles.


> You'd have to reindex the metadata (roles access), which may be substantial if you have a complex enough schema with enough users/roles.

Right, but this compare this to the original proposal:

> A basic implementation will return the top, let's say 1000, documents and then do the more expensive access check on each of them

Using an index is much better than that.

And it should be possible to update the index without a substantial cost, since most of the 100000 documents likely aren't changing their role access very often. You only have to reindex a document's metadata when that changes.

This is also far less costly than updating the actual content index (the vector embeddings) when the document content changes, which you have to do regardless of your permissions model.


I don't understand how "using an index" is a solution to this problem. If you're doing search, then you already have an index.

If you use your index to get search results, then you will have a mix of roles that you then have to filter.

If you want to filter first, then you need to make a whole new search index from scratch with the documents that came out of the filter.

You can't use the same indexing information from the full corpus to search a subset, your classical search will have undefined IDF terms and your vector search will find empty clusters.

If you want quality search results and a filter, you have to commit to reindexing your data live at query time after the filter step and before the search step.

I don't think Elastic supports this (last time I used it it was being managed in a bizarre way, so I may be wrong). Azure AI Search does this by default. I don't know about others.


> I don't understand how "using an index" is a solution to this problem. If you're doing search, then you already have an index

It's a separate index.

You store document access rules in the metadata. These metadata fields can be indexed and then use as a pre-filter before the vector search.

> I don't think Elastic supports this

https://www.elastic.co/docs/solutions/search/vector/knn#knn-...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: