Fine-tuned models: vLLM framework
About vLLM
How to get started with vLLM and Local Deployment
Retrieving, Packaging Fine-tuned Model Weights and Saving to S3 Object Store
The key idea for the fine-tuned model weights packaging is that: 1) it must be HuggingFace Hub format with specific required files.


On the left, we have the model artifacts setup from ADB post fine-tuning, while on the right, we have the original model weight setup hosted on HuggingFace Hub. As you can see, the following files should always be present to ensure that inference via vLLM can work as expected:
1.config.json 2.generation_config.json 3.weights (.safetensors, .bin format) 4.model.{weight_format).index.json 5.special_tokens_map.json 6.tokenizer.json 7.tokenizer.model 8.tokenizer_config.json
If you have used model downloaded from HuggingFace Hub for model fine-tuning, rest assured the format post fine-tuning should usually work.
.
└── main
├── config.json
├── generation_config.json
├── model-00001-of-00003.safetensors
├── model-00002-of-00003.safetensors
├── model-00003-of-00003.safetensors
├── model.safetensors.index.json
├── special_tokens_map.json
├── tokenizer.json
├── tokenizer.model
└── tokenizer_config.json
For instance, if you were to leave the model weights set up like the above, when you compress the weights folder into a tarfile as indicated in Chat models: vLLM framework, your path to the weights should be: /main .
As another example, if you will to put the model weights in the following structure:
.
├── snapshots
│ └── main
│ ├── config.json
│ ├── generation_config.json
│ ├── model-00001-of-00003.safetensors
│ ├── model-00002-of-00003.safetensors
│ ├── model-00003-of-00003.safetensors
│ ├── model.safetensors.index.json
│ ├── special_tokens_map.json
│ ├── tokenizer.json
│ ├── tokenizer.model
│ └── tokenizer_config.json
# tar -cf {tarfile-name}.tar.gz {model-weights-folder-name}
tar -cf mistralai-Mistral-7B-Instruct-v0_2-ft.tar.gz models--mistralai--Mistral-7B-Instruct-v0_2
# or
tar -cf model-weights.tar.gz snapshots # based on the above examples
# or
tar -cf model-weights.tar.gz main # based on the above examples
How to get started with vLLM and Docker Deployment
Building vLLM Inference Server Docker Image
As the vLLM Inference Server Docker image is fundamentally a Docker image that only contains the vLLM setup, individual developers would NOT be required to build a new vLLM image every time a new deployment is conducted. As such, a standardised repository path is created to host the [latest] image that can be re-used across different deployments.
# From dir where you want to clone the repository:
git clone <https://github.tools.sap/Artificial-Intelligence-CoE/AICore-vLLM.git>
# From the AICore-LLM directory
cd inference-server
# Build docker image
docker build -t <dockerRepository>/aicore-vllm:v<ddmmyy>-<minorVersion> .
# e.g.,
# Push docker image to the registered repository
docker push <dockerRepository>/aicore-vllm:v<ddmmyy>-<minorVersion>
# e.g.,
Run vLLM Inference via Docker Image
AICore deployment
Source Code
Compelte source code Source Code {:target="_blank"}.