Service
Pipeline
Outsource compute. Keep the understanding.
We are scoping a bring-your-own-model compression service. The principle is simple: you keep the weights, the evals, and the reasoning; we return a smaller model that decides the same way.
Currently taking conversations. We will publish a fixed-scope offering once we have run it end-to-end with three teams.
How it works
Three stages. Two weeks of calendar time, mostly compute. You keep everything at every step.
- 01
Intake
You send the FP32 checkpoint and the workload that matters. We read the architecture, profile sensitivity per layer, and propose a compression schedule. NDA first, weights second.
- 02
Compression
Pruning, quantization, and selective distillation run against your eval. Every cut is logged. Every restore path is preserved. We do not delete sources.
- 03
Handoff
You get a quantized checkpoint, a signed eval suite, a model card, and the calibration set used. Everything reproducible from one release hash. If you walk away, you walk away with everything.
What you receive
- Quantized weights (INT8 or per-layer mixed precision).
- The frozen eval suite we ran against, plus the seed.
- A model card with per-layer sensitivity and accuracy delta vs your FP32 baseline.
- A deployment guide for the target hardware you specified.
- A release hash that reproduces all of the above from the source you sent.
Reach out
If you are shipping a model into production and the inference cost is becoming a problem, write us. Lead with the model and the target hardware.
[email protected]