The Problem
Technical teams at a renewable energy and hydrogen operator spent an inordinate amount of time searching for information before making engineering and regulatory decisions. Documentation was fragmented across multiple repositories, file formats, and systems. A critical question — about a specific process parameter, a regulatory threshold, or a historical equipment failure — required searching through dozens of documents manually, often across multiple hours.
The volume and technical density of hydrogen and renewable energy documentation made the problem structural, not just inconvenient. Regulations, engineering specs, maintenance records, and vendor documentation exist as PDFs, scanned documents, and embedded tables — formats that standard search tools cannot interrogate intelligently.
The Constraints
Domain accuracy was non-negotiable. In hydrogen and renewable energy operations, incorrect information has physical consequences. An AI system providing a wrong regulatory threshold or a misquoted safety parameter could drive bad engineering decisions. Hallucination was not an acceptable failure mode — the system had to know what it did not know.
Heterogeneous document formats. The knowledge base included scanned PDFs, image-embedded tables, handwritten field notes digitized to image files, and structured SQL databases with operational history. A retrieval system that could only handle clean text would miss a substantial portion of the available knowledge.
Conversation continuity. Engineers ask follow-up questions. A system that answered each question in isolation — without retaining context from prior exchanges — required users to re-establish context constantly, defeating the purpose of an intelligent assistant.
Our Approach
We built a multi-agent Retrieval-Augmented Generation (RAG) system using LangChain and OpenAI GPT-4, with specialized agents handling distinct query types in a coordinated pipeline.
The knowledge base construction phase used OCR (pyTesseract) to extract text from image-embedded documents and scanned PDFs — making previously inaccessible content retrievable. All documents were chunked, embedded, and stored in ChromaDB for semantic similarity search at query time.
Four specialized agents handle the retrieval and reasoning pipeline:
- Router agent classifies each query and dispatches it to the appropriate specialist
- Document search agent performs semantic retrieval from the ChromaDB vector store
- SQL agent queries structured operational databases and generates the corresponding visualizations
- Synthesis agent integrates outputs across sources into a coherent, cited response
The system maintains conversation history within each session, allowing engineers to ask follow-up questions without re-stating context. A relevance filter prevents the system from attempting to answer questions outside its knowledge domain — it returns “I don’t have sufficient information on this” rather than hallucinating an answer.
FastAPI serves the conversational interface; the entire system is containerized in Docker and deployed on Azure.
The Outcome
- Information retrieval time dropped from hours to seconds for standard technical queries
- Analysis speed improved by 85% through automated SQL query generation and visualization
- Domain accuracy held at 100% on evaluated queries — the RAG architecture eliminated hallucinations by grounding every response in retrieved documents
- Knowledge access expanded to include previously inaccessible document formats, increasing the effective size of the queryable knowledge base significantly
Team
Engagement: 3 months, 3 engineers (1 AI/ML, 1 backend, 1 data engineering).
Stack: Python, LangChain, OpenAI GPT-4, FastAPI, ChromaDB, pyTesseract, Azure, Docker