GitHub - harshit2490/Multilingual-Documents-Parser-Rag-App: This repository implements a Retrieval-Augmented Generation (RAG) application that supports multilingual documents and queries.

Multilingual RAG Demo

demo.mp4

Multilingual RAG — Project Overview and Developer Guide

This repository implements a Retrieval-Augmented Generation (RAG) application that supports multilingual documents and queries. Core components:

Embeddings: BAAI/bge-m3 (shared multilingual vector space)
Vector store: ChromaDB (persistent local store)
LLM: Google Gemini via API (generation)
API: FastAPI endpoints for ingestion and querying
UI: Streamlit app for interactive ingestion and Q&A

This README summarizes the project flow, folder layout, and the role of each file to help contributors and maintainers.

Quickstart

Create and activate a virtual environment:

python -m venv .venv
.venv\Scripts\activate

Install dependencies:

pip install -r requirements.txt

Copy and edit environment variables:

copy .env.example .env
# Edit .env and set GEMINI_API_KEY, GEMINI_MODEL, etc.

Run the API server:

uvicorn app.main:app --reload --host 0.0.0.0 --port 8000

Or run the Streamlit UI:

streamlit run streamlit_app.py

High-level Architecture & Data Flow

Ingest: files (PDF/TXT/MD) or URLs are uploaded or fetched.
Load & extract text: PDF/TXT/MD loaders extract readable content.
Chunking: text is split into smaller chunks using language-aware separators.
Embeddings: each chunk is encoded with BAAI/bge-m3 into a vector.
Persist: vectors and cleaned metadata are stored in ChromaDB under chroma_db/.
Query: user query -> retriever returns top-K similar chunks.
Generation: retrieved chunks are combined into a prompt and sent to Gemini to produce a final answer.

This flow supports cross-language retrieval because the embeddings are normalized into a shared space.

Repository Layout

./
├── app/                  # Core backend package (FastAPI + RAG logic)
│   ├── __init__.py
│   ├── config.py         # Settings dataclass and env var defaults
│   ├── embeddings.py     # Embedding model loader (HuggingFace wrapper)
│   ├── ingest.py         # File/URL ingestion, extraction, and chunking
│   ├── vectorstore.py    # ChromaDB wrapper, add_documents, retriever
│   ├── rag.py            # Build prompt, call Gemini, format sources
│   ├── main.py           # FastAPI endpoints: /health, /ingest, /query
│   ├── network.py        # Helper to clear broken local proxy env vars
│   ├── schemas.py        # Pydantic models for requests/responses
│   └── utils.py          # Chunker, validation, metadata cleaning
├── chroma_db/            # Persisted ChromaDB files (created automatically)
├── requirements.txt
├── Dockerfile
├── streamlit_app.py      # Streamlit UI for ingestion and Q&A
└── README.md             # This file

Key files and responsibilities

app/config.py
- Centralized configuration via a Settings dataclass.
- Environment-driven: GEMINI_API_KEY, GEMINI_MODEL, EMBEDDING_MODEL, CHROMA_PERSIST_DIR, CHUNK_SIZE, CHUNK_OVERLAP, TOP_K.
app/embeddings.py
- Wraps the embedding model (HuggingFaceEmbeddings) and normalizes embeddings.
- Caches the model instance and clears a known-broken local proxy before downloads.
app/ingest.py
- Ingests local files (ingest_local_file) and uploaded files (ingest_uploaded_file).
- Fetches webpages (ingest_url) and strips irrelevant tags.
- Uses langchain loaders (PyPDFLoader, TextLoader) to create Document objects.
- Calls chunk_documents and then add_documents_to_vectorstore.
app/vectorstore.py
- Creates/returns a Chroma collection (persisted under CHROMA_PERSIST_DIR).
- add_documents_to_vectorstore validates and cleans metadata before storing.
- get_retriever(top_k) returns a retriever configured by TOP_K.
app/rag.py
- Handles retrieval + generation: fetches top documents, formats context, builds prompt.
- Sends generation requests to Gemini using GEMINI_API_KEY and GEMINI_MODEL.
- Returns the answer string and a list of Source objects (content + metadata).
app/main.py
- FastAPI app with endpoints:
  - GET /health — lightweight health check (does not load the embedding model).
  - POST /ingest/file — accepts file uploads and returns number of chunks added.
  - POST /ingest/url — ingest text from a URL.
  - POST /query — accepts a query and optional top_k, returns answer + sources.
app/network.py
- Small utility to clear environment proxy variables that can break model downloads (127.0.0.1:9).
app/schemas.py
- Pydantic models for request validation and response formatting.
app/utils.py
- chunk_documents uses RecursiveCharacterTextSplitter with separators tuned for multi-language text.
- validate_supported_file, ensure_directory, clean_metadata helpers.
streamlit_app.py
- Interactive UI for uploading files, ingesting, and asking questions.
- Shows ingestion progress and displays retrieved sources with metadata.

Configuration (environment variables)

Set values in .env or your environment. Important keys:

GEMINI_API_KEY — required for generation (when using Gemini provider).
GEMINI_MODEL — default: gemini-2.5-flash (adjust as available).
EMBEDDING_MODEL — default: BAAI/bge-m3 (shared multilingual vector space).
CHROMA_PERSIST_DIR — default: ./chroma_db.
CHROMA_COLLECTION — collection name stored in Chroma.
CHUNK_SIZE, CHUNK_OVERLAP — control text splitting (tune for best context length).
TOP_K — default number of chunks to retrieve for the query pipeline.

Running and testing

API (development):

uvicorn app.main:app --reload --port 8000
# Open http://localhost:8000/docs for API docs and testing

Streamlit UI:

streamlit run streamlit_app.py
# Open http://localhost:8501

Ingest via curl (file):

curl -X POST http://localhost:8000/ingest/file -F "file=@/path/to/doc.pdf"

Query via curl:

curl -X POST http://localhost:8000/query -H "Content-Type: application/json" -d '{"query":"What does the Hindi doc say about AI?","top_k":4}'

Troubleshooting & Tips

First run may download large embedding models — be patient or run on a machine with GPU.
If HF downloads fail due to a local proxy error (127.0.0.1:9), the app calls clear_dead_local_proxy() to remove those env vars. Remove them manually if needed.
If a PDF is scanned (images only) you must OCR before ingesting — the current loaders expect selectable text.
To improve relevance, tune CHUNK_SIZE / CHUNK_OVERLAP depending on document structure.

Docker

Build and run the app with Docker (mount chroma_db to persist vectors):

docker build -t multilingual-rag .
docker run --env-file .env -p 8000:8000 -v "%cd%/chroma_db:/app/chroma_db" multilingual-rag

Extending the project

Add OCR (e.g., Tesseract or Azure Cognitive Services) before ingestion for scanned PDFs.
Swap the LLM provider (replace Gemini HTTP call with another provider wrapper).
Add authentication around the API for production deployments.

If you'd like, I can also:

Add a short developer README with local debug commands.
Generate a small diagram of the data flow.
Move the current layout into a src/ package and add compatibility shims.

Updated file: README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multilingual RAG Demo

Multilingual RAG — Project Overview and Developer Guide

Quickstart

High-level Architecture & Data Flow

Repository Layout

Key files and responsibilities

Configuration (environment variables)

Running and testing

Troubleshooting & Tips

Docker

Extending the project

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
app		app
demo		demo
Dockerfile		Dockerfile
README.md		README.md
requirements.txt		requirements.txt
streamlit_app.py		streamlit_app.py

Folders and files

Latest commit

History

Repository files navigation

Multilingual RAG Demo

Multilingual RAG — Project Overview and Developer Guide

Quickstart

High-level Architecture & Data Flow

Repository Layout

Key files and responsibilities

Configuration (environment variables)

Running and testing

Troubleshooting & Tips

Docker

Extending the project

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages