Ok, about those support documents, AKA "Lore"
These are documentation, coding standards, style guides, framework and/or project build guidelines and other representations of personal, customary or organizational standards.
You likely already have assembled a significant collection of these, in the form of those Cursor and GitHub Copilot instruction files, etc.
Such house lore is represented via RDF
WTF RDF ?
- The Resource Description Framework yields “data to be processed” rather than “text to be interpreted”
- These RDF files contain ontologies for a document or codebase
- Built by gen and/or hand and then run through audit iterations
- can greatly reduce ambiguity, though at a tradeoff cost of some syntax, verbosity and token usage
- Curated & maintained in a Knowledge Graph, by hand or gen
Domain / Project Lore
Source code examples, being implementations, give you a good “how”, with elaborations dependent upon documentation discipline. Additional information about the code’s structure, reasoning and rationale can greatly enhance prompt efficacy.
This can be rendered via both source code and domain knowledge summary documents, represented in RDF format to convey disambiguated specification and guidance.
- not necessarily or even usually just a single file
- all examples use RDF in turtle format. JSON-LD is another option
- whimsically, these are molds used to help cast a particular information processing shape
LLMs; consuming an Internet-full of information then guesstimating semantics. And these are often very GOOD estimates! But intent can be missed. RDF-enhanced specification & example files can improve one-shot codegen efficacy by clarifying a very particular context: YOUR house practices, YOUR domain base etc.
So, RAG?
Well, yeah. “Regular” RAG would be embedding lore into the LLM vector space, baking it in. This example process used here is more akin to a rudimentary graphRAG, with the appropriate support materials selected and presented manually into the prompt context. MCP + agent + full graph database can be used as well of course. The goal was to illustrate how structured, specific & relavant information can augment the inference efficacy from the LLMs massive but more generalized latent space.