Scalable Model Deployment and Serving on TensorOpera AI

Scalable Model Deployment and Serving on TensorOpera AI

TensorOpera AI - Your GenAI Platform at Scale - provides a full-stack of GenAI model service capabilities such as model deployment, inference, training, AI Agent, GPU job scheduling, multi-cloud/decentralized GPU management, experiment tracking, and model service monitoring. In this article, we introduce how TensorOpera model deployment and serving work.

Five-layer Model Serving Architecture

Figure 1: Overview of TensorOpera model deployment and serving

The figure above shows TensorOpera AI's perspective of model inference service. It's divided into a 5-layer architecture:

  • Layer 0 (L0) - Deployment and Inference Endpoint. This layer enables HTTPs API, model customization (train/fine-tuning), scalability, scheduling, ops management, logging, monitoring, security (e.g., trust layer for LLM), compliance (SOC2), and on-prem deployment.
  • Layer 1 (L1) - TensorOpera Launch Scheduler. It collaborates with the L0 MLOps platform to handle deployment workflow on GPU devices for running serving code and configuration.
  • Layer 2 (L2) - TensorOpera Serving Framework. It's a managed framework for serving scalability and observability. It will load the serving engine and user-level serving code.
  • Layer 3 (L3) - Model Definition and Inference APIs. Developers can define the model architecture, the inference engine to run the model, and the related schema of the model inference APIs.
  • Layer 4 (L4): Inference Engine and Hardware. This is the layer many machine learning system researchers and hardware accelerator companies work to optimize the inference latency & throughput. At TensorOpera, we've developed our in-house inference library ScaleLLM (https://scalellm.ai)

Distributed/Cross-cloud Model Endpoint

One of the unique features in TensorOpera AI model serving is that a single endpoint can run across geo-distributed GPU clouds. It further scales AI Agents by enabling distributed vector DB. Such a system can provide a range of benefits, from improved reliability to better performance:

  • Increased Reliability and Availability. By spreading resources across multiple clouds, developers can ensure that if one provider experiences downtime, the model service can still run on the other providers. Automated failover processes ensure that traffic is redirected to operational instances in the event of a failure.
  • Scalability. Cloud services typically offer the ability to scale resources up or down. Multiple providers can offer even greater flexibility and capacity. Distributing the load across different clouds can help manage traffic spikes and maintain performance.
  • Performance. Proximity to users can reduce latency. By running endpoints on different clouds, developers can optimize for geographic distribution. Different clouds might offer specialized GPU types that are better suited for particular types of workloads.
  • Cost Efficiency. Prices for cloud services can vary. TensorOpera AI uses multiple providers allowing developers to take advantage of the best pricing models. Developers can bid for unused capacity at a lower price, which might be available from different providers at different times.
  • Risk Management and Data Sovereignty. Distributing across different regions can mitigate risks associated with policy regulations affecting service availability. For compliance reasons, developers might need to store and process data in specific jurisdictions. Multiple clouds can help meet these requirements.

ScaleLLM: Serverless and Memory-efficient Model Serving Engine for Large Language Models

Figure 2: Overview of ScaleLLM

ScaleLLM is a serverless and memory-efficient model serving engine for large language models (LLMs). With industrial-grade design and optimization of model inference techniques, including weight quantization, KV Cache quantization, fast attention, and fast decoding, ScaleLLM has achieved the following remarkable results:

  • ScaleLLM can now host one LLaMA-2-13B-chat inference service on a single NVIDIA RTX 4090 GPU. The inference latency is up to 1.88 times lower than that of a single service using vLLM on a single A100 GPU.
  • ScaleLLM can now host three LLaMA-2-13B-chat inference services on a single A100 GPU. The average inference latency for these three services is 1.21 times lower than that of a single service using vLLM on a single A100 GPU.
  • In response to the demand for generating the first token after a prompt within 1 second, ScaleLLM has successfully migrated the inference service for LLaMA-2-13B-chat to a single L4 or T4 GPU. Such a fast response time can significantly improve the quality of experience for the end users.

These achievements demonstrate the advanced capability of ScaleLLM in reducing the cost of LLM inference to 20x cheaper than A100 on AWS.

Thanks to the memory optimization offered by ScaleLLM, developers can now smoothly deploy AI models across a decentralized network of consumer-grade GPUs. To facilitate this, we further introduce the TensorOpera AI platform (https://TensorOpera.ai). TensorOpera AI delivers comprehensive APIs, CLIs, and a user-friendly operational UI, empowering scientists and engineers to scale their model deployments on decentralized on-demand instances. Notably, these deployed endpoints can span multiple GPU instances, complete with failover and fault-tolerance support.

ScaleLLM, in conjunction with the Nexus AI platform, marks a paradigm shift in the methodology of AI model deployment. It equips developers with the ability to seamlessly deploy AI models across a decentralized network of consumer-grade GPUs, including models such as RTX 4090 and RTX 3090. Beyond the computational efficiency this approach offers, it also provides significant cost advantages, being 20 times more economical compared to similar services like AWS. This blend of functionality and affordability positions it as a revolutionary force in the AI deployment arena.

For more details, please read our blog: https://blog.tensoropera.ai/scalellm-unlocking-llama2-13b-llm-inference-on-consumer-gpu-rtx-4090-powered-by-fedml-nexus-ai/

Zero-code Serving: Pre-Built Models by Clicking "Deploy" Twice

  1. Find the Model and Click "Deploy"
  1. Select GPU Type and Click "Deploy"

Quick Start: Serve Pre-built LLMs by Command Line and Dashboard

  1. Create a LLM model from Hugging Face

Use fedml model create command to create a model. For options, use -n to indicate the model card name, then use -m to indicate a pre-built model.In this quick start example, we will try to deploy an EleutherAI/pythia-70m model from Hugging Face. To use a hugging face model, you will need to use hf: as the prefix of the model name. So the full model name is hf:EleutherAI/pythia-70m.

fedml model create -n hf_model -m hf:EleutherAI/pythia-70m
  1. Test the model on the local machine

Use fedml model deploy --local command to deploy the model on the current machine. Use -n to indicate the model card name.

fedml model deploy -n hf_model --local
  1. Push the model to TensorOpera AI

Use fedml model push command to push the model. Use -n to indicate the model card name. Replace $api_key with your own API key.

fedml model push -n hf_model -k $api_key
  1. Find Model on TensorOpera AI Dashboard
  1. Edit Model Information
  1. Deploy the model to the Serverless GPU cloud
  1. Check the Endpoint Status

Further Advances: Customize, Deploy, and Scale

TensorOpera AI platform provides many further capabilities to customize, deploy, and scale any model. A specific deployment workflow is shown in the diagram above. Developers can deploy their custom docker image or just serve related Python scripts into TensorOpera AI platform for deployment in the distributed/decentralized GPU cloud. For detailed steps, please refer to https://docs.tensoropera.ai

  1. Create Your Customized Model

For more details, please see the TensorOpera documentation: https://docs.tensoropera.ai/deploy/create_model

  1. Bring your own GPUs: Running a Single Model Endpoint across Geo-distributed GPUs

When allocating on-premise resources for a specific endpoint, we can select GPU workers across geo-distributed GPUs.

The geo-distributed GPUs can be binded by Compute -> My Servers with one-line command "fedml login <API key>". For more details, please see the TensorOpera documentation: https://docs.tensoropera.ai/deploy/deploy_on_premise

  1. Manage & Monitor Deployed Models

About TensorOpera, Inc.

TensorOpera is your generative AI platform at scale to enable developers and enterprises to build and commercialize their own generative AI applications easily, scalably, and economically. Its flagship product, TensorOpera AI, provides unique features in enterprise AI platforms, model deployment, model serving, AI agent APIs, launching training/Inference jobs on serverless/decentralized GPU cloud, experimental tracking for distributed training, federated learning, security, and privacy.