Artificial Intelligence

Private AI for Security & Compliance | Deploy Ollama in Your Xano Instance

If you're building AI-powered applications but want to keep your data private and reduce costs, deploying Ollama directly inside your Xano instance is the answer. Instead of sending data to external providers like OpenAI or Anthropic, you can run large language models entirely within your own infrastructure. Here's how to set it all up.

Setting Up a Persistent Volume

Start by navigating to your instance settings and opening the Microservices panel. The first thing you'll configure is a persistent volume — this is where your downloaded models will be stored. Naming it ollama helps keep things organized. Set the size to 100GB to give yourself room to load multiple models, and keep the storage class as SSD. This persistent volume ensures your models survive any instance restarts.

Creating the Ollama Deployment

Next, head to the Deployments section and add a new deployment. Name it ollama, set replicas to 1, and use the public Docker image ollama/ollama. You'll set the container port to 11434, which is Ollama's default.

Attach the persistent volume you just created and set the mount path to /root/.ollama/models so downloaded models save to the right location.

For resources, configure the CPU minimum at 500m (half a core) with a maximum of 2,000m, and set RAM with a minimum of 4,096MB and a maximum of 8,192MB. Once everything looks good, click Add, then Update and Deploy.

Pulling a Model with a Microservice Function

Once your microservice is provisioned, navigate to a function or endpoint in Xano. Add a Microservice function from the function stack — it works similarly to an external API request. Set the host using port 11434 and import the provided curl command to pull your chosen model (in this example, the Phi-3 Mini model). Set the timeout to 60 seconds and run it. You'll see a 200 status response confirming the model pulled successfully.

Sending Prompts to Your Local Model

With the model ready, clone your microservice function and overwrite it with the curl command for sending chat prompts. You can customize the prompt to anything you like — for example, "Explain why the sky is blue." Because local inference takes more time than a cloud API, bump the timeout up to 300 seconds to avoid timeouts during longer responses.

After running, you'll receive a full response generated entirely within your Xano instance — no data leaves your environment.

Why This Matters

Running models locally gives you complete control over your data, which is critical for compliance-sensitive applications. It also eliminates per-token API costs that can add up quickly at scale. With Xano's microservices support, the whole setup requires no server management — just configuration through the UI.

Sign up for Xano

Join 100,000+ people already building with Xano.
Start today and scale to millions.

Start building for free