BMO

Be More

Ollama Powered

Python Based

Documentation in SolidJS

Introduction

This project implements BMO, a fully voice-controlled personal AI assistant designed to operate primarily on a local system, with cloud integrations used only where necessary. The assistant continuously listens to the user’s voice, converts speech to text, interprets commands, generates responses using text-to-speech, executes system or web-based actions, and optionally logs conversations to a database.

BMO is designed to support the following core functionalities:

Real-time voice interaction
Local Large Language Model (LLM) inference
Custom voice command creation
System and application automation
Persistent conversation storage

The overall architecture follows a modular and event-driven design, optimized for real-time responsiveness, extensibility, and efficient resource usage. This structure allows the assistant to remain responsive while handling concurrent audio processing, command execution, and background tasks.

Project Expectation

The primary goal of the BMO project is to deliver a hands-free, voice-controlled AI assistant that operates efficiently on local hardware while maintaining natural, high-quality user interactions. The system is designed to minimize reliance on cloud-based Large Language Models, thereby ensuring reduced latency, improved performance, and enhanced data privacy.

The key expectations of this project include:

Hands-free interaction using voice commands
Low-latency local inference without relying on cloud LLMs
Extensibility, allowing new commands to be created dynamically
Persistent memory for storing conversations and custom commands
Natural interaction using high-quality text-to-speech responses
System-level control such as opening applications and websites

These expectations are fulfilled through the integration of offline speech recognition, local LLM inference, database persistence, and operating system–level automation into a unified and efficient workflow.

Project Implementation

The BMO AI assistant is implemented using a Python-based backend powered by a locally hosted language model optimized for on-device performance. The system follows a modular design that enables real-time voice processing, intelligent response generation, and seamless execution of system-level actions.

Environment Configuration

Environment variables are loaded using dotenv to securely manage sensitive credentials and system configurations. The application validates all critical configurations at startup and exits gracefully if any mandatory variable is missing, ensuring predictable runtime behavior.

MongoDB connection URI
Vosk speech recognition model path
ElevenLabs API key for text-to-speech
Chrome executable path for browser automation

Voice Input Pipeline

The voice input pipeline is responsible for continuous microphone monitoring and real-time speech recognition. Audio input is captured using sounddevice and streamed in small chunks for processing.

Each audio chunk is placed into a thread-safe queue, where it is processed by the Vosk speech recognition engine. The recognized text is then forwarded to a separate queue for command interpretation and response generation. This decoupled architecture prevents audio capture from being blocked by downstream processing.

Concurrency Model

The system employs a multi-threaded concurrency model to maintain responsiveness during continuous operation. A dedicated background thread is responsible for audio listening and speech recognition, while the main thread processes recognized text and executes commands.

Python’s queue.Queue is used to ensure safe and reliable communication between threads. This design allows the assistant to handle long-running tasks, such as LLM inference or system automation, without interrupting real-time voice input.

Ollama LLM

Ollama serves as the core engine responsible for all local language model inference within the BMO system. It provides a dedicated runtime designed to load, manage, and optimize large language models directly on the host machine. This enables the assistant to perform natural language processing entirely offline, ensuring data privacy and independence from cloud-based services.

LLM Integration

The Python backend communicates with Ollama using a lightweight subprocess-based execution model. User prompts are passed directly to the bmo language model, and generated responses are captured from standard output in real time. Timeouts and structured error handling mechanisms are implemented to prevent blocking behavior and ensure overall system stability.

All responses generated by the Ollama runtime are processed entirely offline. When no predefined or custom command matches a user request, the assistant automatically falls back to the local LLM to generate an intelligent, context-aware response.

Key Advantages

Complete data privacy through local inference
Zero dependency on cloud-based language models
Fast response times with minimal latency
Predictable performance and hardware-level control

Ollama also manages essential model lifecycle operations such as model loading, unloading, caching, and request handling. These optimizations reduce inference latency and memory overhead, allowing continuous and responsive interaction between the speech recognition pipeline and the response generation layer. This architecture enables future model upgrades or replacements without requiring changes to higher-level application logic.

Python Libraries

The backend of BMO is built using a carefully selected set of Python libraries, each responsible for a specific subsystem within the assistant. This modular selection ensures clean separation of concerns, maintainability, and reliable real-time performance across audio processing, inference, automation, and data persistence.

The speech recognition pipeline relies on offline-capable libraries that enable continuous voice input processing without network dependency. These libraries handle microphone capture, audio buffering, and real-time speech-to-text conversion.
Libraries: Vosk (offline speech recognition), SoundDevice (real-time microphone input)

Spoken responses are generated using a high-quality text-to-speech engine that converts assistant output into natural-sounding voice audio. The integration ensures low-latency playback and smooth user interaction.
Libraries: ElevenLabs (text-to-speech synthesis)

Local language model inference is executed through a controlled process invocation layer. This enables the assistant to communicate with the Ollama runtime, pass user prompts, and retrieve generated responses in a safe and non-blocking manner.
Libraries: Subprocess (local LLM execution)

System-level automation is achieved through libraries that allow programmatic control over keyboard input, application execution, and browser interaction. These capabilities enable the assistant to perform real-world actions in response to voice commands.
Libraries: PyAutoGUI (keyboard and OS automation), Webbrowser (browser control)

Persistent storage is handled through a database client that enables structured logging of conversations and custom commands. This ensures long-term memory and reliable retrieval of interaction history.
Libraries: MongoDB (PyMongo)

Concurrency and internal communication are managed using thread-safe data structures and execution models. These libraries allow the system to process audio input, LLM inference, and automation tasks concurrently without blocking the main execution flow.
Libraries: Threading, Queue

Contributions

This project welcomes contributions from developers interested in advancing conversational AI systems. Areas for contribution include model fine-tuning, implementing new features like voice input/output, enhancing the UI/UX design, optimizing inference performance, and adding support for additional language models. Contributors can also help improve documentation, write tests, and suggest architectural improvements. All contributions should follow the project's coding standards and include appropriate documentation and test coverage.

Chat Section Explanation

The chat handling mechanism in BMO is designed to manage structured conversations between the user and the assistant in a reliable and persistent manner. Each interaction is captured as a request–response exchange, enabling the system to maintain conversational context and store meaningful interaction history.

During runtime, conversations are first stored in an in-memory conversation buffer. Each exchange consists of the user’s spoken request, the assistant’s generated response, and a corresponding timestamp. This buffering approach allows the assistant to operate efficiently without performing frequent database writes.

When explicitly instructed by the user, the buffered conversation data is uploaded to a MongoDB database for long-term storage. Each record is stored in a structured format, ensuring consistency and enabling future retrieval, review, or analysis of past interactions.

In addition to standard conversations, the chat system supports dynamic custom command creation. Users can define new request–response pairs during runtime, which are immediately stored in the database and loaded into memory. This allows the assistant to expand its capabilities over time without requiring code modifications or restarts.

This chat architecture balances responsiveness, reliability, and extensibility. By separating real-time interaction from persistent storage, the system ensures smooth conversational flow while maintaining a scalable and maintainable backend design.