Model Deployment¶
Overview¶
Deploy FRAMEWORM models to production with export, serving, and containerization.
Export Models¶
TorchScript¶
from frameworm.deployment import ModelExporter
exporter = ModelExporter(model, example_input)
exporter.to_torchscript('model.pt', method='trace')
ONNX¶
Quantization¶
Serve Models¶
FastAPI Server¶
from frameworm.deployment import ModelServer
server = ModelServer('model.pt')
server.run(host='0.0.0.0', port=8000)
Or via CLI:
API Endpoints¶
POST /predict- JSON predictionPOST /predict/image- Image predictionPOST /predict/batch- Batch predictionGET /health- Health checkGET /docs- API documentation
Example Request¶
curl -X POST http://localhost:8000/predict \
-H "Content-Type: application/json" \
-d '{"data": [[1.0, 2.0, 3.0, 4.0, 5.0]]}'
Docker Deployment¶
Build Image¶
Run Container¶
Docker Compose¶
Kubernetes Deployment¶
Deploy¶
Check Status¶
Scale¶
Production Best Practices¶
- Use quantization for faster inference
- Enable caching for repeated requests
- Set up monitoring (Prometheus/Grafana)
- Use load balancing (nginx/K8s service)
- Implement rate limiting
- Add authentication for sensitive models
Performance Optimization¶
Batch Processing¶
Process multiple samples together:
# Better: batch of 32
output = model(batch_of_32)
# Slower: one at a time
for sample in samples:
output = model(sample)
ONNX Runtime¶
2-5x faster inference:
from frameworm.deployment import ONNXInferenceSession
session = ONNXInferenceSession('model.onnx')
output = session.run(input_data)
GPU Inference¶
Enable CUDA for faster serving: