AI Chatbot Development: Challenges and Learnings

2025.11.17.

As a company developing IT solutions, it is important for us to keep up with the latest technologies. That is why we set ourselves the goal of creating an AI chatbot that helps users by building on large language models (LLMs).

When it came to the chatbot’s focus, we chose the GDPR and the EU AI Act, based on our previous experience. The goal was to create a system that could interpret and answer questions related to data protection legislation, based solely on the knowledge we provided.

AI-chatbot-fejlesztés, AI chatbot development

We launched the project within an R&D framework, specifically for experimental purposes — the emphasis was not on the finished product, but on the journey: learning, gaining experience, and exploring the limits of technology.

Initial Objective and Technology Choices

The first objective was simple: “Answer the question on the topic using the knowledge materials provided.”

Building on the experience of our previous projects, the technology stack was quickly developed:

LLM: OpenAI models – initially chatgpt-4o-mini, later chatgpt-4.1-mini,
Vector-based search: ChromaDB with file system-level storage,
Embedding: text-embedding-3-large,
Framework: LangChain and LangGraph,
Backend: FastAPI (Python),
Frontend: React + NextJS.

From the outset, the goal was not to build from scratch, but to work by building on existing components, through experimentation and iteration.

The First Version – Testing from a New Perspective

The first version of the chatbot forwarded the results obtained from RAG-based searches to the LLM along with the user’s question, adding a short instruction on what kind of answer we expected. During the RAG query, we used the entire user message, which later turned out to be a bad solution. We built a static verification engine to check the messages, which relied on a predefined keyword dictionary.

During the first round of tester feedback, it became clear that the initial solution was not adequate. The testers made legitimate demands of the chatbot that we had not considered at the outset.

Testing an LLM-based chatbot was also a new challenge, as it was completely unfamiliar territory for the testers. This meant that the developers also faced challenges that required a new approach to solve.

Taking the testers’ feedback into account, new requirements came to the fore: The chatbot should

remember previous questions and answers,
remain exclusively within the specified topic,
not execute commands,
not be forced to disclose technical information,
always communicate in a polite and professional tone,
be able to handle everyday conversations (e.g., greetings, follow-up questions, references to earlier questions),
be able to respond in multiple languages,
provide relevant, contextually appropriate responses,
have an appropriate response time (maximum 15-20 seconds),
warn the user not to provide personal information if necessary.

Testers attempted to explore the limits of the chatbot with creative wording that could “trick” the chatbot, taking into account the entire context of the conversation. This type of testing differed significantly from traditional software testing methodologies.

Challenges for Developers

When fulfilling the new requirements presented above, it quickly became clear that developers needed to take a completely new approach. Classic methods, such as static verification, did not work reliably, so the team had to use solutions that were truly suited to the operation of an LLM-based chatbot.

Off-topic filtering – introduction of system prompts

The chatbot had to be controlled so that it only responded within the specified topic.

The development team solved this by defining a system prompt that predetermined what the model could talk about, how it could respond to different situations, and what it should reject.

This solution proved to be much more reliable than the previously used keyword checking.

Memory management – maintaining context

The first version did not remember previous conversations, so there was a lack of continuity.

The development team solved the problem by using LangChain memory components (RunnableWithMessageHistory, InMemoryChatMessageHistory), which are capable of preserving the history of the dialogue, thus making communication more natural.

The result was noticeable: the chatbot was able to use the conversation history when responding, which improved user experience.

Conversational skills – agent-based approach

After introducing memory, it soon became clear that this alone was not enough. The chatbot was unable to handle everyday conversational situations such as asking back questions or referring to previous messages.

To solve this problem, our developers used an agent-based approach instead of the previous manually compiled LLM calls, using the LangChain tool_calling_agent component.

This solution enabled the chatbot to better understand the context, recognize the logical connections in the conversation, and respond more naturally. This made the dialogues more realistic and continuous, but also introduced new security risks: In certain situations, the chatbot disclosed information that it should not have.

The team managed to address this issue by finetuning the system prompt, but even this did not cover all cases.

Model change – reprompting

As the project progressed, the knowledge base managed by the chatbot and the size of the system prompt also grew. The chatgpt-4o-mini model was no longer able to serve our needs at the required speed and quality, so an LLM change was necessary.

Our team initially chose the chatgpt-5-mini model, but communication via the OpenAI API proved to be too slow, with response times often exceeding 30 seconds.

Ultimately, the chatgpt-4.1-mini model proved to be suitable for the task we wanted to use it for.

However, the switch required a major overhaul. The prompts used up to that point had to be completely rewritten, as different models responded differently to the same instruction. Rewriting the prompts entailed further testing and prompt refinement iterations.

Managing command execution

As chatbot communication became more natural, users were able to persuade it to execute commands using various phrases, such as writing recipes or generating prompts.

Our developers solved this by introducing a verification tool that recognizes messages referring to command execution. By refining the system prompt, the chatbot now filters these out before the RAG query and refuses execution if necessary.

This solution has greatly increased the security of the system and prevented the chatbot from being used to perform unwanted actions.

Warning about personal data

We wanted the chatbot to recognize when a user was providing personal data and, in such cases, warn them to avoid providing it.

We achieved the best results by inserting a separate processing step in front of the agent, which automatically removed personal data from the user’s message using a separate LLM call and a dedicated prompt.

As a result, the agent worked with two pieces of information: it knew whether the original message contained personal data, and it received the cleaned message, so it could not return the personal data contained in the original message in its response.

AI Chatbot Development: Mission Accomplished

By the end of the project, a stable, secure, and user-friendly chatbot was created, which uses an LLM- and RAG-based approach.

Several optimization steps were taken during development. In terms of memory management, the development team replaced the RunnableWithMessageHistory solution with the built-in LangChain AgentExecutor function, supplemented with our own session-based memory management. After fine-tuning the RAG parameters and testing several embedding models, we were able to improve both response time and search accuracy.

The completed chatbot is capable of context-dependent and multilingual responses, formulating security alerts, and conducting natural, human-like conversations.

What We Learned

The development process was not only a technological learning experience, but also a learning experience in terms of mindset. The team learned how LLM-based systems work in practice and what unique challenges their development entails. During testing, it became clear that evaluating such systems differs from traditional software testing. The behavior of a chatbot is context-dependent, so testing must also be dynamic, creative, and adaptive.

Creative testing played a key role and attempts to deliberately “trick” the model helped reveal the system’s weaknesses.

From a developer’s perspective, the most important lesson was that the success of an LLM-based system is determined by its architecture. A thorough understanding and conscious combination of components such as prompt handling, RAG, agent mechanisms, and memory is essential for creating a stable and expandable system.

Overall, the project proved that the real challenge of artificial intelligence and large language models lies not in the model itself, but in how we build the system around it.

Do you think your company could use an AI chatbot? Why not discuss this over a cup of coffee?