While building several micro-agents in Coda, my clients and I have had ample opportunity to experiment with different document types and formats across a few LLMs (Gemini, GPT, Perplexity).
We have found that the document format can indeed have a significant impact on the model’s ability to ‘understand’ the information and provide an accurate response.
We are injecting approx 200kB pages into our prompts as a kind of ‘poor mans’ RAG solution. (This way, we can use native Coda pages for our knowledge base instead of an external Vector DB).
Top of the list for best results is YAML. This is best for highly structured information where we would normally use JSON to provide a strict structure. But JSON consumes a LOT more tokens and we get far more ‘lost the plot’ errors - Gemini and GPT seem to follow complex YAML better.
Next is Markdown-formatted human-oriented text. The emphases given by various headings, the structure given by indented lists (numbered or not) and the convenience and clarity provided by tables all contribute to the model’s ability to ‘follow the plot’ and understand the content.
BUT - the style and language used in the documents has a big impact (see below).
Richly formatted pages imported from MS Word, Google Docs, RTF tools etc. (and richly formatted native Coda pages) come next - but in practice, these are probably converted internally to Markdown before being injected into the context window.
Even when using Vector Databases, we found PDFs to be terrible. The PDF data structure encodes positioning and font-setting for the text, but loses the human-readable structure, such as headings, indentations, bullet lists, table layouts, etc. We find we need to track down the original docs and use them instead. (Andrew Ng has set up a company to address this with PDFs).
We found that PowerPoint and other presentation formats are best imported into Coda as rich text first, so they are presented to the LLM as a set of sections within a page (we do not yet have a way to understand the pictures, but we expect that to be fixed soon).
The other dimension we have found that matters greatly is the Style or Language used in these documents.
Top of the list for clarity are technical specifications and ‘how-to’ documents; Procedure definitions, Checklists, Training documents, Corporate Wikis, etc. These have clear, precise, uncluttered language, focused on delivering good understanding with minimal ‘fluff’.
Next on the list are well-crafted narrative docs like articles, journalism, management reports, official memos, etc. Professional, impersonal language with a clear purpose.
Not so useful are transcripts of meetings, message threads from Slack, WhatsApp, Discord, or Email threads. These contain a lot of ‘chatter’ and emotive language that detracts from the clarity of purpose. They can be used. They work. Just not as effectively.
And worst of all (in our experience) are marketing and sales documents, which seem to sugar-coat everything with salesmanship and overly flowery descriptions. The LLM is naive and gullible and swallows this stuff whole with great credulity. It works if you want the agent to persist in the propaganda, but it’s not great for executing strictly business operations.
Of course, all this is changing rapidly. So by the time you read this, it is probably out of date. And, as always, your mileage may vary.
But I wanted to share our hard-won experience, because the LLM Labs seem not to want us to know the weaknesses or limitations of their technology.
Respect,
➤𝖒𝖆𝖝