demo.mp4
This repository implements a Retrieval-Augmented Generation (RAG) application that supports multilingual documents and queries. Core components:
- Embeddings:
BAAI/bge-m3(shared multilingual vector space) - Vector store: ChromaDB (persistent local store)
- LLM: Google Gemini via API (generation)
- API: FastAPI endpoints for ingestion and querying
- UI: Streamlit app for interactive ingestion and Q&A
This README summarizes the project flow, folder layout, and the role of each file to help contributors and maintainers.
- Create and activate a virtual environment:
python -m venv .venv
.venv\Scripts\activate- Install dependencies:
pip install -r requirements.txt- Copy and edit environment variables:
copy .env.example .env
# Edit .env and set GEMINI_API_KEY, GEMINI_MODEL, etc.- Run the API server:
uvicorn app.main:app --reload --host 0.0.0.0 --port 8000- Or run the Streamlit UI:
streamlit run streamlit_app.py- Ingest: files (PDF/TXT/MD) or URLs are uploaded or fetched.
- Load & extract text: PDF/TXT/MD loaders extract readable content.
- Chunking: text is split into smaller chunks using language-aware separators.
- Embeddings: each chunk is encoded with
BAAI/bge-m3into a vector. - Persist: vectors and cleaned metadata are stored in ChromaDB under
chroma_db/. - Query: user query -> retriever returns top-K similar chunks.
- Generation: retrieved chunks are combined into a prompt and sent to Gemini to produce a final answer.
This flow supports cross-language retrieval because the embeddings are normalized into a shared space.
./
├── app/ # Core backend package (FastAPI + RAG logic)
│ ├── __init__.py
│ ├── config.py # Settings dataclass and env var defaults
│ ├── embeddings.py # Embedding model loader (HuggingFace wrapper)
│ ├── ingest.py # File/URL ingestion, extraction, and chunking
│ ├── vectorstore.py # ChromaDB wrapper, add_documents, retriever
│ ├── rag.py # Build prompt, call Gemini, format sources
│ ├── main.py # FastAPI endpoints: /health, /ingest, /query
│ ├── network.py # Helper to clear broken local proxy env vars
│ ├── schemas.py # Pydantic models for requests/responses
│ └── utils.py # Chunker, validation, metadata cleaning
├── chroma_db/ # Persisted ChromaDB files (created automatically)
├── requirements.txt
├── Dockerfile
├── streamlit_app.py # Streamlit UI for ingestion and Q&A
└── README.md # This file
-
app/config.py- Centralized configuration via a
Settingsdataclass. - Environment-driven:
GEMINI_API_KEY,GEMINI_MODEL,EMBEDDING_MODEL,CHROMA_PERSIST_DIR,CHUNK_SIZE,CHUNK_OVERLAP,TOP_K.
- Centralized configuration via a
-
app/embeddings.py- Wraps the embedding model (
HuggingFaceEmbeddings) and normalizes embeddings. - Caches the model instance and clears a known-broken local proxy before downloads.
- Wraps the embedding model (
-
app/ingest.py- Ingests local files (
ingest_local_file) and uploaded files (ingest_uploaded_file). - Fetches webpages (
ingest_url) and strips irrelevant tags. - Uses
langchainloaders (PyPDFLoader,TextLoader) to createDocumentobjects. - Calls
chunk_documentsand thenadd_documents_to_vectorstore.
- Ingests local files (
-
app/vectorstore.py- Creates/returns a
Chromacollection (persisted underCHROMA_PERSIST_DIR). add_documents_to_vectorstorevalidates and cleans metadata before storing.get_retriever(top_k)returns a retriever configured byTOP_K.
- Creates/returns a
-
app/rag.py- Handles retrieval + generation: fetches top documents, formats context, builds prompt.
- Sends generation requests to Gemini using
GEMINI_API_KEYandGEMINI_MODEL. - Returns the answer string and a list of
Sourceobjects (content + metadata).
-
app/main.py- FastAPI app with endpoints:
GET /health— lightweight health check (does not load the embedding model).POST /ingest/file— accepts file uploads and returns number of chunks added.POST /ingest/url— ingest text from a URL.POST /query— accepts aqueryand optionaltop_k, returns answer + sources.
- FastAPI app with endpoints:
-
app/network.py- Small utility to clear environment proxy variables that can break model downloads (
127.0.0.1:9).
- Small utility to clear environment proxy variables that can break model downloads (
-
app/schemas.py- Pydantic models for request validation and response formatting.
-
app/utils.pychunk_documentsusesRecursiveCharacterTextSplitterwith separators tuned for multi-language text.validate_supported_file,ensure_directory,clean_metadatahelpers.
-
streamlit_app.py- Interactive UI for uploading files, ingesting, and asking questions.
- Shows ingestion progress and displays retrieved sources with metadata.
Set values in .env or your environment. Important keys:
GEMINI_API_KEY— required for generation (when using Gemini provider).GEMINI_MODEL— default:gemini-2.5-flash(adjust as available).EMBEDDING_MODEL— default:BAAI/bge-m3(shared multilingual vector space).CHROMA_PERSIST_DIR— default:./chroma_db.CHROMA_COLLECTION— collection name stored in Chroma.CHUNK_SIZE,CHUNK_OVERLAP— control text splitting (tune for best context length).TOP_K— default number of chunks to retrieve for the query pipeline.
- API (development):
uvicorn app.main:app --reload --port 8000
# Open http://localhost:8000/docs for API docs and testing- Streamlit UI:
streamlit run streamlit_app.py
# Open http://localhost:8501- Ingest via curl (file):
curl -X POST http://localhost:8000/ingest/file -F "file=@/path/to/doc.pdf"- Query via curl:
curl -X POST http://localhost:8000/query -H "Content-Type: application/json" -d '{"query":"What does the Hindi doc say about AI?","top_k":4}'- First run may download large embedding models — be patient or run on a machine with GPU.
- If HF downloads fail due to a local proxy error (
127.0.0.1:9), the app callsclear_dead_local_proxy()to remove those env vars. Remove them manually if needed. - If a PDF is scanned (images only) you must OCR before ingesting — the current loaders expect selectable text.
- To improve relevance, tune
CHUNK_SIZE/CHUNK_OVERLAPdepending on document structure.
Build and run the app with Docker (mount chroma_db to persist vectors):
docker build -t multilingual-rag .
docker run --env-file .env -p 8000:8000 -v "%cd%/chroma_db:/app/chroma_db" multilingual-rag- Add OCR (e.g., Tesseract or Azure Cognitive Services) before ingestion for scanned PDFs.
- Swap the LLM provider (replace Gemini HTTP call with another provider wrapper).
- Add authentication around the API for production deployments.
If you'd like, I can also:
- Add a short developer README with local debug commands.
- Generate a small diagram of the data flow.
- Move the current layout into a
src/package and add compatibility shims.
Updated file: README.md


