MRLBA: Natural Language Database Queries

Jan 14, 2025

Text-to-SQL

RAG LLM SQLite Llama 3.1 Python

Making Data Queryable Without SQL

This project demonstrates how to bridge the gap between raw data and insight. Instead of requiring SQL knowledge, users ask questions in plain English and get instant answers—no database expertise needed.

The Challenge

Large CSV datasets (1.7GB+) are hard to query without technical skills. SQL is a barrier. Existing solutions require cloud services or expensive APIs. I built a system that keeps everything local and makes data exploration trivial.

The Solution: Text-to-SQL RAG

The system converts natural language questions into SQL queries, executes them against a local database, and synthesizes results back into readable answers. All processing happens on your machine—no data leaves.

graph TB A[Large CSV Dataset] -->|Ingestion| B[SQLite Database] D[User Question] --> C[Intent Classification] C -->|Database Question| E[Generate SQL
via LLM] C -->|Irrelevant| F[Reject] E --> G[Execute Query
on SQLite] G --> H[Query Results] H --> I[Synthesize Answer
via LLM] I --> J[Natural Language
Response] style A fill:#e1f5ff style B fill:#c8e6c9 style J fill:#c8e6c9

Architecture

Data Layer: SQLite database created once from CSV. Pandas handles chunked ingestion (100K rows at a time) to manage memory efficiently.

Query Engine: Two-stage LLM pipeline. First stage classifies intent (is this a database question?). Second stage generates SQL and synthesizes results.

Local LLM: Ollama + Llama 3.1 runs entirely locally. No cloud API calls, no rate limits, no costs per query.

Example Queries

Users can ask questions like:

“How many records match these criteria?"
"What’s the average value across the dataset?"
"Show me entries with X greater than Y"
"Which items are in location Z?”

Why This Matters

No SQL needed - Anyone can query the database in natural language.

Private and fast - All processing local, no data sent to APIs, instant results.

Grounded answers - The LLM synthesizes responses based on actual data, not hallucinations.

Simple codebase - Two Python scripts handle everything: ingestion and querying.

Technical Depth

The system handles SQL generation errors gracefully, validates questions before execution, and manages large result sets efficiently. It demonstrates understanding of LLM limitations—classifying intent before generation prevents wasted queries and hallucinated SQL.

What This Shows

This project proves I can build practical RAG systems that ground LLMs in real data. It’s not just prompt engineering—it’s architectural thinking about how to make AI systems reliable and useful at scale.