Guide

How To Query Your Database With AI: A Comprehensive Guide

›

Introduction

Imagine your business teams having instant access to the data they need without having to reach out to your engineers or data scientists for simple data pulls. Whether it's customer success managers needing quick insights or marketing teams tracking campaign performance, AI-powered database querying can make your entire organization more self-sufficient and agile.

With AI, natural language requests can be transformed into accurate SQL queries, providing immediate answers and freeing up your data team for more critical tasks. This guide explores the complexities of building such a system and what you need to consider when creating a tool that your business teams can use to access your database.

Here's what we'll cover:

Understanding the basics of database querying with AI
The complexities of building an AI-powered query systemsome text
- Securely accessing the database
- Selecting relevant tables and columns
- Training the system
- Ensuring safe and Effective AI-Generated Queries
- Analyzing results
The value of using an off-the-shelf solution like Locusive
Conclusions and why you might want to use Locusive

Let’s dive into how AI can revolutionize your database querying process and empower your teams.

What is AI-Powered Database Querying?

AI-powered database querying allows users to interact with their databases using natural language. Instead of writing complex SQL queries, users can simply ask questions in plain English, and AI translates these requests into precise SQL commands. This capability is incredibly valuable for non-technical users, as it eliminates the need for specialized knowledge of database languages.

At its core, AI-powered database querying is all about making data accessible to everyone in your organization, regardless of their technical expertise. It's like having a personal data assistant that understands your questions and knows exactly where to find the answers. And if you're looking to create a system that allows you to use ChatGPT, or any other LLM, on all your internal data, including your internal database in the system is necessary to ensure the AI can find the right answers from the right locations among all the various data systems you use internally.

Ease of Use for Non-Technical Users

One of the main advantages of AI-powered querying is its accessibility. Non-technical users, such as business analysts, marketing teams, or customer success managers, can retrieve complex data without needing to write SQL. This democratizes data access, allowing teams to make data-driven decisions quickly and independently.

AI-powered database query tools should be designed to be intuitive and user-friendly, enabling anyone to ask questions and get answers without having to learn a new language or rely on the data team for assistance.

Here are a few examples of how non-technical users can benefit from AI-powered database querying:

A sales representative asking, "Which customers had the highest order value last quarter?"
An HR manager requesting, "What was the average time to fill open positions this year?"
A customer success manager inquiring, "How many support tickets were resolved within 24 hours last week?"

With AI-powered database querying, these questions can be answered quickly, empowering teams to make data-driven decisions and take action promptly. This eliminates the need to wait for reports or rely on intuition, providing clear and accurate insights at users' fingertips. In addition, by integrating your database querying system into a simple interface, like a chatbot that your company already uses, can make it seamless for your employees to get access to the data they need quickly and easily.

However, the process of how AI knows which data to access and how to generate the correct queries is complex. In the next section, we will explore the intricacies of building an AI-powered query system, including the challenges of creating a tool that can securely and effectively query a database using AI.

The Complexities of Building an AI-Powered Query System

Building an AI-powered database querying system is a complex undertaking that requires careful consideration of various factors, such as security, data relevance, error correction, and AI training. In this section, we’ll dive deeper into each of these challenges and the best practices associated with creating a robust and efficient AI-powered query system.

Securely accessing the database

One of the most critical aspects of building an AI-powered database querying system is ensuring the security of database credentials. These credentials, which typically include usernames, passwords, and database connection strings, must be protected from unauthorized access to maintain the integrity and confidentiality of the data.

Encryption plays a vital role in securing database credentials. By encrypting the credentials before storing them, you can prevent potential attackers from accessing sensitive information even if they manage to breach your system. You should use separate storage systems for the system that stores your encrypted credentials and the system that stores the keys used to encrypt those credentials.

In addition, you should ensure that the user that accesses your database has read-only access to your system, and it’s not a bad idea to also create a read-only replica of your main database so that the system doesn’t run requests on your live production database.

Selecting Relevant Tables and Columns

Another crucial aspect of building an AI-powered database querying system is determining which tables and columns the AI should have access to. The selection of relevant data sources directly impacts the accuracy and performance of the generated queries, as well as the overall efficiency of the system.

Selecting too many tables and columns can lead to increased token counts, higher costs, and less accurate results (due to the LLM getting confused or forgetting key details), but not providing the right tables and columns to the system will prevent it from knowing how to properly query your data.

When choosing tables and columns for your AI to access, consider the following criteria:

Relevance to user queries: The selected tables and columns should contain the data that is most likely to be requested by users. This requires a thorough understanding of your organization's data needs and the types of questions that users are likely to ask. By focusing on the most relevant data sources, you can ensure that the AI has the necessary information to generate accurate and meaningful queries.
Data quality and consistency: The AI-powered query system relies on the quality and consistency of the data it accesses. Inconsistent or incomplete data can lead to inaccurate query results and diminish user trust in the system. Therefore, it is essential to select tables and columns that are well-maintained, regularly updated, and adhere to data quality standards.
Performance considerations: The size and complexity of the selected tables and columns can significantly impact the performance of the AI-powered query system. Large tables with numerous columns can slow down query generation and execution, leading to longer wait times for users. To ensure optimal performance, it is recommended to select tables and columns that are appropriately indexed and optimized for querying.
Security and privacy: When selecting data sources for the AI to access, it is crucial to consider the security and privacy implications. Sensitive or confidential information should be carefully evaluated and, if necessary, excluded from the AI's access. This may involve implementing additional security measures, such as data masking or tokenization, to protect sensitive data while still allowing the AI to generate relevant queries.

The impact of selecting relevant tables and columns on query accuracy and performance cannot be overstated. By providing the AI with access to the most pertinent data sources, you can improve the precision of the generated queries and reduce the likelihood of irrelevant or incorrect results. This, in turn, enhances the user experience and increases the adoption of the AI-powered query system within your organization.

An example of how Locusive allows you to include or exclude tables from your database

By carefully selecting the relevant tables and columns for your AI-powered query system, you can strike a balance between accuracy, performance, and security. This lays the foundation for a reliable and efficient system that meets the needs of your organization and empowers users to access the data they require quickly.

In the next section, we will explore the process of training the AI system to understand and generate accurate queries based on the selected data sources.

Training the system

Despite how powerful LLMs have become, they won’t know the ins and outs of your system without a lot of context. Everyone’s database is different, and every database has its own nuances and complications when it comes to pulling the right data. You need to provide an LLM with enough context to create the right query for any given situation.

We have three main suggestions for doing so:

Create a reference guide that provides a high-level overview or manual for the LLM
Use selected example queries that help the LLM understand how to pull data
Inject the schema on every request

Note that in the remainder of this guide, we’ll assume that you haven’t fine-tuned your own LLM model on your database. If you have, many of the guidelines below may not be as important, though you may still want to consider implementing them based on how nuanced your data is or how well your fine-tuned model is performing.

Reference guides

A reference guide is essentially a manual that describes your database structure, tables, columns, and their relationships. It can also provide other data or context that could be necessary for an LLM to understand how to access your data. The guide serves as a foundation for the AI to understand the organization and purpose of your data, enabling it to generate more accurate and contextually relevant queries.

When creating a reference guide, consider including the following elements:

A high-level overview of the tables and columns and the structure of your database
Explicit instructions about when to join, and when not to join, certain tables
Examples or distributions of the data that are stored in certain columns, particularly if those examples never change or the data in those columns are used for important queries (i.e. if you have a list of timezones in your database that are associated with locations for major employees, you could provide a list of those timezones in the reference guide)

Ultimately, a reference guide allows your database to be grounded with the right instructions, and it should be provided on every request. But a reference guide can only take you so far. LLMs have been shown to respond extremely well to examples, and if you provide example queries in addition to your reference guides, you’ll likely see higher accuracy with your queries.

Example queries

Example queries serve as a learning mechanism for the AI, allowing it to understand how to construct effective queries based on user input.

When crafting example queries, consider the following best practices:

Cover a wide range of use cases: Provide example queries that cover a diverse range of use cases and query patterns. This helps the AI learn how to handle different types of user requests and generate appropriate queries accordingly.
Include query descriptions or a description of the question that each query is answering: Alongside each example query, provide a clear description of what the query does, the expected results, and any relevant context. These descriptions help the AI understand the intent behind each query and how to map user requests to similar query patterns.
Only include the most relevant queries: You may end up having hundreds, or potentially thousands, of example queries to choose from, but including all those examples is a big no-no. Not only will they likely not fit into the context window of the LLM, they’ll likely also confuse the LLM significantly. Instead, what you should do is use semantic search to identify the most relevant queries for the user’s question and inject those into the prompt. Note that we’ve seen better performance when the most relevant queries are placed towards the end of a very long prompt (or the beginning of a very short prompt).

An example of how Locusive's system allows you to provide example queries to the LLM

By providing a diverse set of example queries and leveraging embedding search, you can significantly improve the AI's ability to understand user requests and generate accurate, relevant queries.

Injecting the schema on every request

LLMs are like web servers — they forget about the last request as soon as they’ve served them. Contrary to popular belief, LLMs won’t learn about you or your company or your system unless they are explicitly trained with this data. Most folks think that as soon as you send an LLM some information, it memorizes it and learns it forever.

But that’s not true. Every request you make to an LLM that hasn’t been fine-tuned will only incorporate knowledge from its most recent training period. Providing your database schema on every request will provide the final piece in the puzzle that the LLM needs to create an accurate query on your data.

By combining knowledge of how your schema is structured (excluding any tables or columns that you’ve removed according to the step above) along with example queries and a general reference guide, a smart LLM will usually be able to generate an accurate query to help your users answer their questions.

Training your AI-powered query system is an iterative process that requires continuous refinement and optimization. By creating a comprehensive reference guide, providing diverse example queries, and managing context windows effectively, you can build a robust and accurate system that empowers your users to access the data they need quickly and easily.

In the next section, we will discuss the importance of ensuring safe and effective AI-generated queries, and explore techniques to prevent common pitfalls and errors.

Ensuring Safe and Effective AI-Generated Queries

While AI-powered database querying offers numerous benefits, it's crucial to implement safeguards and best practices to ensure the generated queries are both safe and effective. This section will explore techniques for preventing unintended database modifications, avoiding common query pitfalls, and integrating AI-generated queries with data analysis.

Implementing Hard Checks

One of the primary concerns when using AI-powered database querying is the potential for unintended database modifications. To mitigate this risk, it's essential to implement hard checks that prevent the AI from altering the underlying database.

Read-only access: As mentioned above, you should ensure that the AI system only has read-only access to the database. This prevents any accidental or malicious attempts to modify, delete, or insert data. By restricting the AI's permissions, you can maintain the integrity of your database while still allowing the AI to generate queries for data retrieval.
Query validation: Implement a query validation layer that inspects the AI-generated queries before executing them. This layer should check for any statements that attempt to modify the database, such as INSERT, UPDATE, DELETE, DROP, etc, and block their execution. By validating queries before they reach the database, you can catch and prevent any unintended modifications.
You could also even use an LLM to inspect the SQL provided by a previous LLM call and provide a warning or a danger score that indicates the SQL query would have negative side effects. Doing something like this is relatively easy when building an autonomous agent.

In addition to preventing database modifications, it's important to safeguard against common query pitfalls that can lead to incorrect results or poor performance. Some common pitfalls include:

OR filters without parentheses: When using OR conditions in a WHERE clause, failing to enclose them in parentheses can lead to unexpected results. Ensure that the AI-generated queries properly use parentheses to maintain the intended logic.
Misuse of LIKE, ANY, or LIMIT filters: Overusing or improperly applying LIKE, ANY, or LIMIT filters can significantly impact query performance and result in incomplete or incorrect data retrieval. Implement checks to detect the misuse of these filters and provide guidance to the AI system on their proper usage.

By implementing these hard checks and safeguards, you can ensure that the AI-generated queries are safe and effective, minimizing the risk of unintended database modifications and common query pitfalls.

Combining SQL Queries with Data Analysis

While AI-generated SQL queries are powerful for retrieving data, the real value lies in combining these queries with further data analysis. By integrating AI-generated queries with analytical tools and workflows, you can unlock deeper insights and make data-driven decisions.

Here's an example workflow that demonstrates the integration of AI-generated queries with data analysis:

User input: A sales manager asks, "What are the top 5 products by revenue in the last quarter?"
AI-generated query: The AI system generates the following SQL query based on the user's input:

Query execution: The AI-generated query is executed against the database, retrieving the top 5 products by revenue for the specified quarter.
Report generation or Q&A: The retrieved data is then passed to a reporting tool, or can be passed into a data visualization tool, such as a bar chart or pie chart, to provide a clear and visually appealing representation of the top products. Alternatively, an AI system can further analyze the results and provide insights or recommendations. For example, it may suggest focusing marketing efforts on the top-performing products or investigating the reasons behind the lower revenue of certain products.

This type of analysis is easy to do with Locusive’s AI-powered assistant.

By combining AI-generated queries with data analysis and visualization, you can transform raw data into actionable insights. This integration empowers users to make informed decisions and uncover valuable patterns or trends that may not be immediately apparent from the raw query results.

Retrying Queries with Parsing Errors or No Results

Despite best efforts in training and optimizing an AI-powered query system, there may be instances where the generated queries contain parsing errors or return no results. In such cases, it's important to have a mechanism in place to retry the query generation process and provide meaningful feedback to the user.

Here are some steps to handle parsing errors or empty result sets:

Error detection: Implement error handling logic that captures any parsing errors or empty result sets during query execution. This can be done by analyzing the database response or error messages.
Retry mechanism: If a parsing error or empty result set is detected, trigger a retry mechanism that sends the user's input back to the AI system for another attempt at generating a valid query. This retry process can be repeated a fixed number of times or until a successful query is generated.
Query refinement: During the retry process, the AI system can analyze the previous query attempts and make adjustments or refinements to improve the chances of generating a valid and effective query. This may involve modifying the query structure, adding or removing filters, or adjusting the table and column selections.
User feedback: If the retry mechanism fails to generate a valid query after multiple attempts, provide meaningful feedback to the user. This feedback should explain the issue encountered and suggest possible ways to rephrase or clarify their input. By engaging the user in the query refinement process, you can improve the overall success rate of AI-generated queries.

By implementing a retry mechanism and providing user feedback, you can enhance the resilience and effectiveness of your AI-powered query system. This approach helps to mitigate the impact of parsing errors or empty result sets, ensuring that users can still obtain the desired information even if the initial query generation attempt fails.

In the next section, we will explore the value of using an off-the-shelf solution like Locusive for AI-powered database querying, and how it can simplify the implementation process and provide additional benefits.

The Value of Using an Off-the-Shelf Solution like Locusive

Building an AI-powered database querying system from scratch is no small feat. It requires significant time, effort, and financial investment to develop and maintain a robust and reliable system that can handle the complexities of natural language processing, query generation, and data analysis.

Consider the various components and challenges involved in building such a system:

Developing and training AI models for natural language understanding and query generation
Implementing secure database connectivity and credential management
Designing and maintaining a user-friendly interface for non-technical users
Ensuring the system's scalability and performance as data volumes and user requests grow
Continuously monitoring and updating the system to address any vulnerabilities or errors

These tasks require a dedicated team of AI experts, software engineers, and database administrators, which can be costly and time-consuming for many organizations. Moreover, the ongoing maintenance and improvement of the system can strain internal resources and divert focus from core business objectives.

This is where an off-the-shelf solution like Locusive comes in. Locusive is a comprehensive, ready-to-use AI-powered database querying system that eliminates the need for extensive in-house development and maintenance. By leveraging Locusive, organizations can quickly and easily implement a powerful AI-driven querying solution without the associated complexities and costs.

Key Features of Locusive

Locusive offers a range of advanced features and capabilities that set it apart from traditional querying methods and in-house developed systems:

Autonomous agent capabilities: Locusive employs sophisticated AI algorithms that can understand and process natural language queries, generate optimized SQL queries, and retrieve relevant data from your database. This autonomous agent approach allows users to interact with the system using plain English, making it accessible to non-technical users across your organization.
Integration with internal knowledge bases and databases: Locusive seamlessly integrates with your existing knowledge bases and databases, allowing it to access and analyze data from multiple sources. This integration enables Locusive to provide comprehensive and accurate answers to user queries, drawing insights from the full breadth of your organization's data assets.
Real-time data insights and decision-making support: Locusive goes beyond simple query execution by offering real-time data insights and decision-making support. The system can analyze query results, identify trends and patterns, and provide actionable recommendations based on the data. This enables users to make informed decisions quickly and confidently, without the need for manual data analysis or interpretation.

A preview of the various integrations that Locusive offers, including a database integration

In addition to these key features, Locusive also provides:

A user-friendly interface that allows non-technical users to easily input queries and receive results
Secure authentication and access control mechanisms to protect sensitive data
Scalability to handle large volumes of data and concurrent user requests
Continuous updates and improvements to ensure optimal performance and security

By choosing Locusive as your AI-powered database querying solution, you can:

Save time and resources on development and maintenance
Empower non-technical users to access and analyze data independently
Improve decision-making speed and accuracy with real-time insights
Enhance data security and compliance with built-in safeguards and access controls
Scale your querying capabilities as your data and user needs grow

Locusive offers a cost-effective and efficient way to harness the power of AI for database querying, allowing you to focus on your core business objectives while still reaping the benefits of advanced data analysis and insights.

Conclusion and Call to Action

Throughout this comprehensive guide, we’ve explored the benefits of AI-powered database querying and the challenges that come with building such a system from scratch. There’s no question that getting your employees the data they need will make them more efficient and productive, helping you grow your business and service your clients more effectively. But building a system like this can take a lot of time and money.

By using Locusive, your organization can quickly and easily unlock the full potential of your data, enabling non-technical users to access and analyze information independently. This not only saves time and resources but also fosters a data-driven culture where decisions are based on accurate, up-to-date insights.

Locusive's AI-powered database querying capabilities offer numerous benefits, including:

Increased efficiency and productivity, as users can quickly obtain the information they need without relying on technical experts
Improved decision-making speed and accuracy, thanks to real-time data insights and actionable recommendations
Enhanced data security and compliance, with built-in safeguards and access controls
Scalability to accommodate growing data volumes and user demands, ensuring that your querying capabilities keep pace with your organization's needs

If you’re interested in using Locusive with your own database, feel free to request a free trial.

Our team will get you onboarded, and can even help with creating reference guides and example queries. We work consultatively and our customers’ success is important to us, which is why we tend to spend a lot of time helping customers create the right training instructions, queries, and reference guides for their data.

‍