* Than a (cluster of) H100 GPUs with 8-bit quantisation and FasterTransformer, on Llama2 70B
This means the biggest LLMs in the world running faster than you can read, and a universe of completely new capabilities and possibilities for how we work that will be unlocked by near-instant inference of models with superhuman intelligence.
When a trained language model is run for a user, over 99% of the total compute time is spent not on arithmetic but on moving model weights from memory to the processor
* Llama2 70B at 8 bit quantisation, on 80GB A100 GPU
The time taken to run the arithmetic for generating a single word
The time taken moving parameters from memory to the processor for each word
The total time a Fractile processor takes to generate the same word
The time taken to run the arithmetic for generating a single word
The time taken moving parameters from memory to the processor for each word
The total time a Fractile processor takes to generate the same word
We are a team of scientists, engineers and hardware designers who are committed to building the solutions that the AI revolution requires to keep scaling. We believe that the most important breakthroughs will come from trying solutions that others are not, to serious problems we actually face.
Walter and Yuhang met during their PhDs at the University of Oxford, where they were doing AI research in different labs. Seeing a convergence across multiple domains in AI towards the use of a single type of model - the transformer - they started working on a new approach to accelerated computing, designed from the ground up to run these models at the fastest possible speeds. Having satisfied themselves that these principles promised to allow model inference orders of magnitude faster than existing state of the art, they founded Fractile in summer 2022, soon before the explosion in deployment of large language models made the need for a better way to run these networks yet more urgent.