Skip to content

harshit2490/Multilingual-Documents-Parser-Rag-App

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Multilingual RAG Demo

demo.mp4

Demo screenshot Demo screenshot Demo screenshot


Multilingual RAG — Project Overview and Developer Guide

This repository implements a Retrieval-Augmented Generation (RAG) application that supports multilingual documents and queries. Core components:

  • Embeddings: BAAI/bge-m3 (shared multilingual vector space)
  • Vector store: ChromaDB (persistent local store)
  • LLM: Google Gemini via API (generation)
  • API: FastAPI endpoints for ingestion and querying
  • UI: Streamlit app for interactive ingestion and Q&A

This README summarizes the project flow, folder layout, and the role of each file to help contributors and maintainers.

Quickstart

  1. Create and activate a virtual environment:
python -m venv .venv
.venv\Scripts\activate
  1. Install dependencies:
pip install -r requirements.txt
  1. Copy and edit environment variables:
copy .env.example .env
# Edit .env and set GEMINI_API_KEY, GEMINI_MODEL, etc.
  1. Run the API server:
uvicorn app.main:app --reload --host 0.0.0.0 --port 8000
  1. Or run the Streamlit UI:
streamlit run streamlit_app.py

High-level Architecture & Data Flow

  1. Ingest: files (PDF/TXT/MD) or URLs are uploaded or fetched.
  2. Load & extract text: PDF/TXT/MD loaders extract readable content.
  3. Chunking: text is split into smaller chunks using language-aware separators.
  4. Embeddings: each chunk is encoded with BAAI/bge-m3 into a vector.
  5. Persist: vectors and cleaned metadata are stored in ChromaDB under chroma_db/.
  6. Query: user query -> retriever returns top-K similar chunks.
  7. Generation: retrieved chunks are combined into a prompt and sent to Gemini to produce a final answer.

This flow supports cross-language retrieval because the embeddings are normalized into a shared space.

Repository Layout

./
├── app/                  # Core backend package (FastAPI + RAG logic)
│   ├── __init__.py
│   ├── config.py         # Settings dataclass and env var defaults
│   ├── embeddings.py     # Embedding model loader (HuggingFace wrapper)
│   ├── ingest.py         # File/URL ingestion, extraction, and chunking
│   ├── vectorstore.py    # ChromaDB wrapper, add_documents, retriever
│   ├── rag.py            # Build prompt, call Gemini, format sources
│   ├── main.py           # FastAPI endpoints: /health, /ingest, /query
│   ├── network.py        # Helper to clear broken local proxy env vars
│   ├── schemas.py        # Pydantic models for requests/responses
│   └── utils.py          # Chunker, validation, metadata cleaning
├── chroma_db/            # Persisted ChromaDB files (created automatically)
├── requirements.txt
├── Dockerfile
├── streamlit_app.py      # Streamlit UI for ingestion and Q&A
└── README.md             # This file

Key files and responsibilities

  • app/config.py

    • Centralized configuration via a Settings dataclass.
    • Environment-driven: GEMINI_API_KEY, GEMINI_MODEL, EMBEDDING_MODEL, CHROMA_PERSIST_DIR, CHUNK_SIZE, CHUNK_OVERLAP, TOP_K.
  • app/embeddings.py

    • Wraps the embedding model (HuggingFaceEmbeddings) and normalizes embeddings.
    • Caches the model instance and clears a known-broken local proxy before downloads.
  • app/ingest.py

    • Ingests local files (ingest_local_file) and uploaded files (ingest_uploaded_file).
    • Fetches webpages (ingest_url) and strips irrelevant tags.
    • Uses langchain loaders (PyPDFLoader, TextLoader) to create Document objects.
    • Calls chunk_documents and then add_documents_to_vectorstore.
  • app/vectorstore.py

    • Creates/returns a Chroma collection (persisted under CHROMA_PERSIST_DIR).
    • add_documents_to_vectorstore validates and cleans metadata before storing.
    • get_retriever(top_k) returns a retriever configured by TOP_K.
  • app/rag.py

    • Handles retrieval + generation: fetches top documents, formats context, builds prompt.
    • Sends generation requests to Gemini using GEMINI_API_KEY and GEMINI_MODEL.
    • Returns the answer string and a list of Source objects (content + metadata).
  • app/main.py

    • FastAPI app with endpoints:
      • GET /health — lightweight health check (does not load the embedding model).
      • POST /ingest/file — accepts file uploads and returns number of chunks added.
      • POST /ingest/url — ingest text from a URL.
      • POST /query — accepts a query and optional top_k, returns answer + sources.
  • app/network.py

    • Small utility to clear environment proxy variables that can break model downloads (127.0.0.1:9).
  • app/schemas.py

    • Pydantic models for request validation and response formatting.
  • app/utils.py

    • chunk_documents uses RecursiveCharacterTextSplitter with separators tuned for multi-language text.
    • validate_supported_file, ensure_directory, clean_metadata helpers.
  • streamlit_app.py

    • Interactive UI for uploading files, ingesting, and asking questions.
    • Shows ingestion progress and displays retrieved sources with metadata.

Configuration (environment variables)

Set values in .env or your environment. Important keys:

  • GEMINI_API_KEY — required for generation (when using Gemini provider).
  • GEMINI_MODEL — default: gemini-2.5-flash (adjust as available).
  • EMBEDDING_MODEL — default: BAAI/bge-m3 (shared multilingual vector space).
  • CHROMA_PERSIST_DIR — default: ./chroma_db.
  • CHROMA_COLLECTION — collection name stored in Chroma.
  • CHUNK_SIZE, CHUNK_OVERLAP — control text splitting (tune for best context length).
  • TOP_K — default number of chunks to retrieve for the query pipeline.

Running and testing

  • API (development):
uvicorn app.main:app --reload --port 8000
# Open http://localhost:8000/docs for API docs and testing
  • Streamlit UI:
streamlit run streamlit_app.py
# Open http://localhost:8501
  • Ingest via curl (file):
curl -X POST http://localhost:8000/ingest/file -F "file=@/path/to/doc.pdf"
  • Query via curl:
curl -X POST http://localhost:8000/query -H "Content-Type: application/json" -d '{"query":"What does the Hindi doc say about AI?","top_k":4}'

Troubleshooting & Tips

  • First run may download large embedding models — be patient or run on a machine with GPU.
  • If HF downloads fail due to a local proxy error (127.0.0.1:9), the app calls clear_dead_local_proxy() to remove those env vars. Remove them manually if needed.
  • If a PDF is scanned (images only) you must OCR before ingesting — the current loaders expect selectable text.
  • To improve relevance, tune CHUNK_SIZE / CHUNK_OVERLAP depending on document structure.

Docker

Build and run the app with Docker (mount chroma_db to persist vectors):

docker build -t multilingual-rag .
docker run --env-file .env -p 8000:8000 -v "%cd%/chroma_db:/app/chroma_db" multilingual-rag

Extending the project

  • Add OCR (e.g., Tesseract or Azure Cognitive Services) before ingestion for scanned PDFs.
  • Swap the LLM provider (replace Gemini HTTP call with another provider wrapper).
  • Add authentication around the API for production deployments.

If you'd like, I can also:

  • Add a short developer README with local debug commands.
  • Generate a small diagram of the data flow.
  • Move the current layout into a src/ package and add compatibility shims.

Updated file: README.md

About

This repository implements a Retrieval-Augmented Generation (RAG) application that supports multilingual documents and queries.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors