Natural Language to Database Query System

Category:

Natural Language Processing

Skills:

SQL,NLTK

Problem Context

The goal of this project was to design a system that allows users to query a database using natural language instead of SQL. The challenge was to bridge the gap between human-friendly input and machine-executable queries, demonstrating both NLP parsing skills and database knowledge.

Collection

I created a structured database and defined realistic query tasks.

  • Database contained tables for students, courses, and enrollment (~50k records)

  • Queries included joins, aggregations, and filtering conditions

Preparation

I set up the mapping between natural language and database schema.

  • Built a dictionary of schema terms (e.g., “class” → course)

  • Preprocessed input text: tokenization, lemmatization, stopword removal

Baseline

A simple rule-based parser was built as proof of concept.

  • Example: “List all students in Computer Science” → SELECT * FROM students WHERE major='Computer Science'

  • Worked well for direct keyword mappings

Modeling

I expanded the system to handle more complex queries.

  • Implemented a seq2seq model trained on NL–SQL pairs

  • Used attention mechanisms for mapping natural language to SQL structure

  • Supported nested queries, conditions, and ordering

Evaluation

I compared generated SQL with ground truth queries.

  • Metrics: Exact Match Accuracy ~85% on test set

  • Executed queries against the database to validate correctness

  • Handled edge cases like synonyms (“professor” vs “instructor”)

Refinement

The system was optimized for usability and transparency.

  • Added a feedback loop: users could edit SQL output if parsing was wrong

  • Provided query explanation in plain English to increase trust

  • Integrated with a simple web UI for demo purposes

Conclusion

The system successfully translated natural language questions into SQL queries with ~85% exact match accuracy. It demonstrated how NLP and database knowledge can be combined to make data more accessible for non-technical users, while still allowing transparency and control for advanced users.

Do you have any project idea you want to discuss about?