%% > Published `=dateformat(this.created, "MMM dd, yyyy")` %% > Published Jan 16, 2024 # Chat With your SQL Database Using Langchain ![[datachat.webp | 300]] ___ ## Background: The Challenge of Complex Data Interactions In the digital age, the ability to swiftly access and analyze data is crucial. I found myself in need of a solution that could navigate through intricate Entity Relationship Diagrams, manage extensive data joins, and incorporate external data sources. The task was daunting: to create a chatbot that could handle such complex data interactions without the user having to dive into the depths of SQL queries. ## Tools: Embracing Langchain for Its Versatility The choice of Langchain for this project was a no-brainer. Its rich ecosystem, complete with out-of-the-box Large Language Model (LLM) and database connectors, made it an ideal fit. While I opted for the ChatGPT model, Langchain's flexibility meant I wasn't confined to just that. This adaptability was key in accommodating different data analysis needs. ## Setting Up the Environment To provide an example I tried to do this from the scratch for anyone to follow: ### Step 1: Establishing a Local SQL Server First a dummy database assuming you have docker installed, we can spin up a local PostgreSQL or SQLServer with this command: ```sh docker run -e "ACCEPT_EULA=Y" -e "MSSQL_SA_PASSWORD=password@1234" \ -p 1433:1433 --name sql --hostname sql \ -d \ mcr.microsoft.com/mssql/server:2022-latest ``` Then Connect to it with any Db Management tool you have. server is `localhost` , username is `sa` and password is whatever you chose above which in our case is `password@1234` and make sure SSL and Certificate trust is selected. > I personally recommend [Azure Data Studio](https://azure.microsoft.com/en-us/products/data-studio/) which is completely free and cross platform with lots of community Db Extensions but it has out of the box connector for SQL Server. ### Step 2: Preparing the Database Once the SQL Server is in place, the next step is to preload it with a sample database. For this tutorial, I used the Northwind Traders database, a popular db example that mimics real-world e-commerce data-store. First Create the Database: ```sql CREATE DATABASE "Northwind" ``` Then Create tables and preload data into it using [this SQL query](https://github.com/microsoft/sql-server-samples/blob/master/samples/databases/northwind-pubs/instnwnd.sql) ### Step 3: Choosing the Right LLM For this project, I OpenAI model since they are the most popular more specifically I use GPT 3.5 turbo and while it's much cheaper and less capable than the newer gpt-4 model, it still more reliable than most other models and also packs enough power for our purposes. and good thing is using langchain you can upgrade to any newer GPT model or even other popular models like Anthropic or open source models. In my case I used Azure OpenAI gpt-35-turbo because I have free monthly azure credits to use through MSDN. You can Sign up for Free Account at [OpenAI](https://platform.openai.com) or register for Azure OpenAI through [Azure OpenAI portal](https://oai.azure.com) ## The Python Program: A Gateway to Simplified Data Analysis I've prepared a complete Python notebook, available for download [here](https://starspak.blob.core.windows.net/blog-media/202401162145200680langchain-sql-analytics.ipynb), which serves as a practical guide. This program illustrates how to integrate Langchain with our SQL setup, enabling seamless data analytics chat. ### Going through notebook We need to install langchain and langchain openai as well as sqlalchemy for talking to SQL Server Database ```python !pip -q install langchain !pip -q install langchain_openai !pip -q install sqlalchemy ``` We will import modules as following. make sure you import the right chat model according to [langchain documentations](https://python.langchain.com/docs/integrations/chat/) Since I used the Azure OpenAI I used [AzureChatOpenAI](https://python.langchain.com/docs/integrations/chat/azure_chat_openai) Instruction but if you choose OpenAI directly you would use [ChatOpenAI](https://python.langchain.com/docs/integrations/chat/openai) or any other model that is there: ```python import os from langchain.agents import create_sql_agent from langchain.agents.agent_toolkits import SQLDatabaseToolkit from langchain.agents.agent_types import AgentType from langchain_openai import AzureChatOpenAI from langchain.sql_database import SQLDatabase from langchain.prompts.chat import ChatPromptTemplate import sqlalchemy as sa ``` To setup Environment variables we will use the following and for OpenAI and Azure OpenAI name of those environment variables is important. Make sure follow the documentation for it : ```python os.environ["AZURE_OPENAI_ENDPOINT"]="your endpint" # Your Azure OpenAI resource endpoint os.environ["AZURE_OPENAI_API_KEY"]="your key" # Your Azure OpenAI resource key os.environ["SQL_SERVER"]="localhost" # Your az SQL server name os.environ["SQL_DB"]="northwind" os.environ["SQL_USERNAME"]="sa" # SQL server username os.environ["SQL_PWD"]="password@1234" # SQL server password ``` Then This Block which connects to SQL Database using `sqlalchemy` and `SQLDatabase` langchain module using `pymssql` ```python connection_url = sa.URL.create( "mssql+pymssql", username=os.getenv('SQL_USERNAME'), password=os.getenv('SQL_PWD'), host=os.getenv('SQL_SERVER'), database=os.getenv('SQL_DB')) db_engine=sa.create_engine(connection_url) db=SQLDatabase(db_engine) ``` I ran into some issues with the pymssql on Apple Silicon (M3 Machine) and running this script fixed my notebook error (assuming you have homebrew and python 3.11 like I do): ```sh brew install FreeTDS export CFLAGS="-I$(brew --prefix openssl)/include" export LDFLAGS="-L$(brew --prefix openssl)/lib -L/usr/local/opt/openssl/lib" export CPPFLAGS="-I$(brew --prefix openssl)/include" pip install --pre --no-binary :all: pymssql --no-cache ``` and Last but not least the final block to put everything together: ```python llm=AzureChatOpenAI(openai_api_version="2023-05-15", azure_deployment="gpt-35-turbo",) sql_toolkit=SQLDatabaseToolkit(db=db,llm=llm) sql_toolkit.get_tools() prompt=ChatPromptTemplate.from_messages( [ ("system", """ you are a very intelligent AI assitasnt who is expert in identifying relevant questions from user and converting into sql queries to generate correcrt answer. Please use the below context to write the microsoft sql queries, dont use mysql queries. context: you must query against the connected database, you can use the database name as northwind. As an expert you must use joins whenever required, make sure look into order details table for further order questions. """ ), ("user","{question}\ ai: ") ] ) agent=create_sql_agent(llm=llm,toolkit=sql_toolkit,agent_type=AgentType.ZERO_SHOT_REACT_DESCRIPTION,verbose=True,max_execution_time=100,max_iterations=1000) while True: question = input("Enter db query (or Enter to quit): ") if not question: break agent.run(prompt.format_prompt(question=question)) ``` Few key points on this code block: 1. system prompt or meta prompt is crucial to set the context. If your table names are not straight forward make sure to clarify and set up context as much as possible to get better result. in that promp we only have one input parameter called `"question"` but as you can imagine you have have mutiple parameter including some that can be dynamically set at runtime in User Interface for instance. 2. It is common for data analytics job to query multiple times and perform double and tripple checks to make sure it has the correct dataset returned for LangChain for Response. Create SQL Agent is flexible, make sure to play with its parameters to get the most desirable outcome. 3. Last part of the code block we have a while loop which keeps asking question about the Database and you can "Chat with your database" but if you leave prompt as blank it will exit out of the loop. ## Conclusion: A Glimpse into the Future of Data Analysis This project is just the beginning of a larger trend. With plug-and-play tools like Langchain becoming increasingly accessible, we're on the brink of a revolution in the data analytics industry. The potential for these technologies to streamline data analysis tasks is immense, and it may very well redefine the role of data analysts in the future.