Run your favorite Gen AI model with control and privacy
Oracle Cloud Infrastructure (OCI) compute provides industry-leading scalability and performance for companies of all sizes. OCI Compute bare metal (BM) and virtual machine (VM) instances are accelerated by industry-leading GPUs from NVIDIA for mainstream graphics, AI inference, AI training, digital twins, and HPC.
Today, a lot of organizations find generative AI offerings to be intrusive or rigid. Intrusive because these applications often run through third-party APIs with different, obscure policies for privacy protection. Rigid because they limit you to using only a particular large language model (LLM) with no flexibility to optimize performance.
From a single GPU VM to a zettascale supercluster, OCI allows you to choose the LLM and get the performance you require while keeping a total control of your data for use cases such as fine-tuning, retrieval-augmented generation (RAG), and AI agent development. To find an updated list of GPU OCI instances (BM and VM), follow this link.
Below you will find an illustrative test with some insights that will be presented by the end of this blog.
Architecture deployed
A simple solution was deployed where the user enters the same prompt to different LLMs running on different hardware. The compute instances are running different LLMs (Mistral, Phi4, Gemma 3, Llama 3.3) using Ollama to serve via network the different models.
The same prompt was used in all cases with ten iterations per model. The measure presented is the “eval rate” of Ollama verbose option.
Performance results
Architecture |
Mistral 7B |
Phi-4 14B |
Gemma 3 12B |
Gemma 3 27B |
Llama 3.3 70B |
|||
No GPU Infrastructure |
N/A |
16.42 tokens/s |
8.40 tokens/s |
9.49 tokens/s |
4.77 tokens/s |
1.87 tokens/s |
||
VM.GPU.A10.1 |
Ampere |
90.12 tokens/s |
43.07 tokens/s |
45.24 tokens/s |
22.89 tokens/s |
3.40 tokens/s |
||
1x NVIDIA A10 GPU |
||||||||
VM.GPU.A10.2 |
Ampere |
91.70 tokens/s |
43.71 tokens/s |
46.16 tokens/s |
23.18 tokens/s |
10.07 tokens/s |
||
2x NVIDIA A10 GPU |
||||||||
BM.GPU.L40S.4 |
Ada Lovelace |
129.62 tokens/s |
62.74 tokens/s |
66.03 tokens/s |
34.00 tokens/s |
15.26 tokens/s |
||
4x NVIDIA L40S GPU |
||||||||
BM.GPU.H100.8 |
Hopper |
212.22 tokens/s |
114.19 tokens/s |
91.07 tokens/s |
56.55 tokens/s |
35.09 tokens/s |
||
8x NVIDIA H100 GPU |
These tests are not intended to show performance of models. Rather, we would like to illustrate a small selection of GPU choices that customers have. Inference performance increases going from top to bottom, most notably for the largest Llama 3.3 70B model. Depending on your SLAs, preferred model, and GPU requirements, you can consider the results above as a general guideline of inference performance.
Takeaway
In making a decision for cloud infrastructure, you should consider:
-
Data privacy and sovereignty: Most of the well known LLM models are available through internet API connections, however data access or sovereignty could be a concern when your company wants to deploy an AI solution. Deploy your AI models locally on your own OCI tenancy in your well-known region. Oracle has AI infrastructure available in several regions around the world.
-
The right accelerated computing platform : OCI is a leading hyperscaler in AI infrastructure. You can train, fine-tune, or serve your AI model. You can also run an AI solution with RAG to access your data. Oracle has more than fifteen NVIDIA GPU shapes including zettascale superclusters and optimized inference software libraries like NVIDIA TensorRT-LLM, NVIDIA NIM inference microservices, and more included as a part of NVIDIA AI Enterprise available on OCI. It is important to measure your GPU requirements to take advantage of hardware and software flexibility to right size your solution avoiding extra costs.
-
Choose the right AI model: Not every application needs the largest or bleeding-edge frontier model. In many cases smaller models could be enough, so consider exploring these to reduce compute usage and cloud spending.
-
Choose the right AI deployment: Oracle has different platforms to deploy your AI solution. You can choose an IaaS solution where you have full control and elasticity, just like it was presented above, or a fully managed service for fine-tuning and hosting industry leading generative AI models with OCI Generative AI. You can visit our AI Solutions Hub where you can find more information about Featured AI Solutions, typical scenarios and OCI AI Blueprints to deploy, scale and monitor GenAI workloads in minutes.
Once you decide to use AI in your company, consider data access and privacy, model size, right deployment model, and overall fit for your use case. OCI offers highly customizable AI infrastructure and fully-managed generative AI services to complement our other strengths in other areas like data management, integration, and SaaS applications.