SharePoint is the grounded RAG layer you already own
The SharePoint tenant your company already pays for is a permission-trimmed, already-indexed grounding layer for AI agents. A bespoke vector pipeline is usually the wrong default for internal work.
Key takeaways
- The tenant semantic index is auto-generated, can't be disabled, and requires no admin involvement. You are paying for a grounding layer whether or not your agent uses it.
- Permission trimming is inherited from the source ACLs, not rebuilt. Grounding only reads content the current user is already authorized to see.
- The counterargument holds for the 20%: the live connector path skips vector embeddings, and Microsoft's own comparison says the synced path retrieves more accurately.
- Grounding does not create leaks. It reads your existing permission mistakes back to every employee at machine speed, which is a governance problem you owned before the agent.
- Without an in-tenant Copilot license, generative answers cap at 7 MB per SharePoint file and 2,048 list rows. Governance-included is cheap, not free of asterisks.
In this article
- Why this argument matters now
- Governance is the feature, not the file store
- The black box you cannot tune
- A worked example: Copilot Studio over a SharePoint list
- Open questions
SharePoint is the grounded RAG layer you already own
Every internal-agent project I see starts the same way. Someone scopes a vector database, picks an embedding model, argues about chunk size, and budgets a quarter to wire up document sync and security trimming. Then they discover that the company's actual knowledge already lives in SharePoint, behind permissions IT spent a decade tuning, and that Microsoft has been quietly indexing all of it the whole time. The vector pipeline is real engineering. It is also, for most internal agents, a rebuild of infrastructure the tenant already ships. The question worth asking before you provision anything: what does the box already do for free, and where does free stop being good enough? This is not a Microsoft sales pitch but a default-architecture argument: the default for an internal agent should be the grounding layer you already own, and the burden of proof should sit on the team that wants to build a new one.
Here is the position. The SharePoint tenant an enterprise already owns is the cheapest, governance-included grounded-RAG knowledge layer for AI agents, and building a bespoke vector pipeline is usually the wrong default for internal agents. Not because the DIY pipeline is bad engineering. Because it re-pays for three things the tenant already gives you: a maintained index, inherited permission trimming, and a billing line that reads zero incremental dollars. For roughly four of every five internal agents, governance-included beats tunable-but-ungoverned. The other fifth is real, and I will name exactly where it lives.
Why this argument matters now
The economics changed when the index stopped being optional. Microsoft now creates a semantic index for every subscription at the tenant and user level. It is described as "an organization-wide index generated from text-based SharePoint Online files," and the documentation is blunt about two properties that matter here: "The indexing process requires no administrative involvement," and "Semantic indexing is an improvement to Microsoft 365 Search and can't be disabled" (Microsoft Learn, semantic indexing). Read that twice. The index exists before your agent does, you cannot turn it off, and you are already paying for it through the subscription. Provisioning a second index to do the same job is the part that needs justification.
The second shift is Microsoft framing DIY indexing as overhead in its own words. The Copilot Retrieval API is pitched as giving you RAG "without the need to replicate, index, chunk, and secure your data in a separate index" (Retrieval API overview). When the vendor whose platform you are on tells you that replicating and chunking your own data is avoidable work, that is worth a pause. The replicate-chunk-secure loop is exactly the project plan a vector-pipeline team writes on day one.
Third, the agent surface caught up. SharePoint is now a first-class, catalog-level grounding target inside agent builders, not a thing you bolt on with custom code. In a Microsoft Power Platform community session, a presenter wired a Copilot Studio agent to a SharePoint list through an out-of-the-box Model Context Protocol connector and had it write extracted data into the list with no integration code at all. The session demonstrates the write direction; the read direction, the grounding side this post argues about, is the same surface running the other way. The point is not the demo. The point is that Microsoft is making SharePoint a bidirectional, no-code agent surface, which is the precondition for treating it as your default knowledge layer rather than a file dump.
If you have read my earlier piece on structured retrieval beating vector RAG for enterprise agents, this is the same instinct applied one layer up. There the argument was about typed rows versus cosine scores. Here it is about owned infrastructure versus rebuilt infrastructure.
Governance is the feature, not the file store
The strongest reason to ground on SharePoint is the permission layer you do not have to write. In a DIY vector store, security trimming is your code. Every query has to re-check the asking user against an access control list you replicated out of the source system, kept in sync, and now own forever. On the tenant index, that work is inherited. When content is indexed, Microsoft says it continues "to honor the user identity-based access boundary so that the grounding process only accesses content that the current user is authorized to access" (semantic indexing). The boundary is not something you build. It is something you fail to break.
The mechanism is documented down to the data structure. Each indexed item from a connector "includes content, metadata (like title and URL), and an access control list (ACL) that enforces permissions," and "Search and Copilot only show items to users who have access in the source system" (connectors overview). Synced connectors "respect source permissions"; federated ones are "secure by design" against the source's OAuth. The semantic index itself "respects all organizational boundaries within your tenant," gated by role-based access control. And critically, indexing does not widen exposure: "indexing data doesn't change access permissions to content."
Compare the two cost structures honestly.
| Property | SharePoint-grounded | DIY vector RAG | |---|---|---| | Index provisioning | Auto-generated, can't be disabled, no admin work (source) | New Azure AI Search or Qdrant/Pinecone tier to stand up | | Incremental cost | Included in the subscription | New recurring line item, billed per search unit | | Permission trimming | Inherited from source ACLs | Your code, your sync, your liability | | Data freshness | Updated docs "immediately indexed"; new docs daily (source) | Whatever your sync job achieves | | Chunker / embeddings / ranker | Microsoft's choice, opaque | Yours to tune | | Storage location | Isolated in-region tenant container | Wherever you put it, secured by you |
Look at the bottom two rows, because that is where the trade lives. You give up the chunker, the embedding model, and the ranker. You get back a maintained index, free trimming, and freshness measured in "immediately" for updates. For an HR policy bot, a procurement assistant, an IT help agent, or any agent answering from documents your colleagues can already open, that trade is lopsided in favor of the box. The governance you would spend a quarter rebuilding ships turned on.
The neighboring argument is observability, which I covered in Microsoft 365 ships agent inventory not observability. Governance-included grounding and governance-included inventory are the same bet: the platform does the boring, liability-heavy plumbing better than you will, and your job is to use it correctly rather than reinvent it.
There is a scarier objection, and it is not technical: grounding amplifies the permission sprawl you already have. Microsoft is direct about the root cause: "Most internal oversharing stems from configuration issues rather than malicious user intent" (mitigate oversharing). The agent does not create the leak. It reads your existing misconfigurations back to every employee at machine speed. A document shared too broadly five years ago was a latent risk; behind a grounded agent it becomes a query away.
This is real, and it is the failure mode you must plan for. But notice what it actually argues. It does not argue for a DIY vector store, because a DIY store inherits the exact same source permissions if you trim correctly, and inherits worse ones if you trim sloppily. The oversharing problem is upstream of the retrieval architecture. It is a property of your ACLs, not your index. So it is not a reason to build your own RAG. It is a reason to fix permissions before you scale any agent, on any architecture.
Microsoft positions specific tooling for exactly this, framed as a prerequisite rather than a nice-to-have: SharePoint Advanced Management for permission-state reports and site access reviews, plus Restricted Content Discovery and Purview DSPM for AI for exposure monitoring. SharePoint Advanced Management is included with the Microsoft 365 Copilot license, which means the remediation toolkit ships with the same license that unlocks the better grounding. Admins can also exclude a site outright by setting "Allow this site to appear in search results" to No, though Microsoft warns this "should only be considered for sensitive data, such as payroll, HR, or financial information," and notes there is "no option to exclude results from Microsoft Search only or semantic indexing only."
The honest summary: the trimming is real and free, but it only reflects the ACLs you maintain. Bad permissions in, fast leaks out. The governance-included claim is true at the retrieval layer and conditional at the configuration layer. Anyone selling you SharePoint grounding without saying that is selling you a future incident.
The black box you cannot tune
Here is where the thesis loses, and it loses for real reasons. If you ground on the live SharePoint connector, you are inside a retrieval system you cannot adjust. The practitioner Harry Traynor put the mechanism plainly: the connector path "doesn't store files in Dataverse, doesn't build vector embeddings," and "the retrieval quality is different, because the content isn't semantically indexed" (why your agent's knowledge breaks on import). You do not pick the chunker. You do not pick the embedding model. You do not pick the ranker. When recall is wrong, your only lever is the source documents, not the retrieval stack.
It gets sharper. The quality is not even uniform inside Microsoft's own stack. Traynor relays that "Microsoft's own comparison shows the ingested/synced files produce more accurate, contextually grounded answers" than the live connector path. So there are two SharePoint paths, and conflating them is the most common analysis error I see. Path (a) is the connector-based knowledge source: searches content live, builds no embeddings, friendly to application lifecycle management. Path (b) is the unstructured-data upload path: it copies files into Dataverse and runs a managed chunk-and-embed pipeline, which is the higher-quality option. Pick the wrong one and you silently degrade retrieval while believing you chose "SharePoint."
So how do I keep the default position? By scoping it. The thesis is "usually" and "for internal agents," not "always." The DIY pipeline wins, clearly, when you need high recall on a large, noisy corpus where one missed chunk is a real failure; when domain-specific chunking matters, like splitting on legal clause boundaries or code blocks; when you need a tuned reranker; or when your corpus is not even in SharePoint. That is the 20%. For a customer-facing legal-research agent over ten thousand contracts, build the vector pipeline and tune it. For the HR bot, do not.
💡 The choice is not SharePoint versus a vector DB. It is connector-live versus Dataverse-synced versus DIY, and most teams degrade their own retrieval by picking the wrong one of the three without knowing there were three.
The honest framing is that you are trading tunability for governance, and the only sin is not knowing which one you bought. If you have decided you genuinely need the tunable path, the Dataverse-synced option inside Microsoft's own stack buys most of the quality back before you ever leave for a third-party vector store, and it keeps the inherited trimming. The all-the-way-out DIY build, fed by a SharePoint indexer or a Graph export into Azure AI Search, is documented as a real pattern (Brian Love's Azure AI RAG over SharePoint), and it reintroduces every cost the tenant index was absorbing for you.
A worked example: Copilot Studio over a SharePoint list
The concrete setup is smaller than the architecture diagrams suggest. Take the agent from that Microsoft Power Platform community session and run it as a grounding case. The presenter dropped a Word document into a Copilot Studio agent and the agent refused it. On GPT-4.1 you enable the Code Interpreter toggle in the agent's settings to read Word natively; on a newer model like GPT-5 the agent reads the document directly with no toggle, and the result, in the presenter's words, "looks better." That single detail, GPT-4.1 needs Code Interpreter while GPT-5 reads Word natively, is the kind of version-specific gotcha that decides whether a pilot works on the first try.
Then the agent wrote the extracted fields into a SharePoint list through the Work IQ SharePoint MCP connector, added from the Tools catalog by filtering on "model context protocol," with nothing supplied but the list URL. The connector auto-discovered the site and list schema. That is the write direction. Flip it and the same surface is your read-side grounding layer: a Copilot Studio agent grounded on that same list answers questions against live rows.
For the read path you would configure a knowledge source, and the auth is the whole gate. The manual configuration requires only two scopes:
knowledge_source:
type: sharepoint
site_url: "https://contoso.sharepoint.com/sites/policies"
scopes:
- Sites.Read.All # read site content for grounding
- Files.Read.All # read documents for generative answers
# list grounding is a real-time connection to the source;
# users authenticate with their own SharePoint credentials
Those scopes (knowledge-add-sharepoint) are read-only, and list grounding opens "a real-time connection to the source, so the most current data is used for queries and reasoning," with users "authenticated using their SharePoint credentials." No replicated copy, no stale snapshot, no separate credential system. The tenant's existing access is the gate.
Now the asterisks, with numbers attached. Without an in-tenant Copilot license, "generative answers can only use SharePoint files that are under 7 MB"; with a Copilot license plus Work IQ that ceiling rises to "files up to 200 MB" (quotas and limits). List queries "only return data from the first 2,048 rows of data," you "can select up to 15 lists at a time," and unstructured sync runs on a "four to six hours" cadence. The general upload cap is "512 MB" per file, per that same quotas page. The 2,048-row ceiling alone disqualifies a list-grounded agent over any large dataset, which is precisely a case where the structured or DIY path earns its cost. The numbers tell you the boundary of the free lunch.
So how do you apply this? Default to the box, then prove you need to leave it. Concretely:
-
Make SharePoint the default, demand justification to deviate. For any agent answering from documents your colleagues can already open, ground on the tenant index first. Put the burden of proof on the team that wants a separate vector store, not on the team using what ships.
-
Pick the path on purpose. Decide explicitly between connector-live (no embeddings, lifecycle-friendly), Dataverse-synced (managed chunk-and-embed, better quality), and full DIY (total control, full cost). The most expensive mistake is picking one by accident. The connector-versus-Dataverse quality gap is real and silent.
-
Audit permissions before you scale, not after. Run SharePoint Advanced Management reports and a Purview DSPM for AI review before the agent goes wide. Grounding will surface every sharing mistake, so treat permission hygiene as a release blocker.
-
Check your numbers against the caps first. If your corpus has files over 7 MB and you have no in-tenant Copilot license, or any list past 2,048 rows, you are already outside the free lunch. Confirm the license and the limits before you scope the build.
-
Reserve the vector pipeline for the 20%. High-recall over large noisy corpora, clause-level or code-level chunking, a tuned reranker, or a corpus that is not in SharePoint at all. If none of those describe your agent, you are gold-plating.
This is the grounding-layer companion to my piece on why most Copilot Studio agent problems are integration problems. That post argued the surface fails before the model does. This one argues the grounding layer is usually a choice you have already paid for, and the failure is rebuilding it by reflex.
Open questions
A few things I am still watching, and expect to update as the platform moves.
The connector-versus-Dataverse quality gap is currently a qualitative claim relayed from Microsoft's own comparison. I have not found an independent, vendor-neutral benchmark, precision and recall, of SharePoint grounding against a tuned Qdrant or Pinecone pipeline, and I want one. Until that exists, "the synced path retrieves better" is a directional claim, not a measured one.
The cost side is also softer than I would like. Microsoft's Azure AI Search tier page gives only a "hypothetical billing rate of $100 per month" example, and the firmer dollar figures circulating, Basic near $75 and Standard S1 around $245 per search unit per month, come from third-party aggregators, not Microsoft. The directional argument holds: a DIY tier is a new recurring line and the tenant index is not. The exact delta deserves a real measurement.
I also expect the caps to move. The 7 MB and 2,048-row limits read like memory-budget artifacts, not permanent design choices, and Work IQ already lifts the file ceiling to 200 MB with a license. If those numbers loosen, the 20% case for going DIY shrinks further. Worth re-checking the quotas page each quarter.
If you are running internal agents and your path choice or your numbers look different from mine, I want the correction. I expect parts of this to age, particularly the caps and the cost figures. The fastest way to change my mind is a concrete counterexample where the DIY pipeline beat the tenant index on a workload that looked like the 80%. Reach me at marcus-duwe.de/contact. If your problem is the 20% and you are scoping the tunable path, the rest of the archive is free, and I would still like to compare notes.