Inference of LLaMA model in pure C/C++

5257 Views 0 Replies 1681485668 Harry Potter

04-14-2023, 03:21 PM |

ROOT

Posts35

Reputation 0

JoinedMay 2022

The main goal is to run the model using 4-bit quantization on a MacBook

Plain C/C++ implementation without dependencies
Apple silicon first-class citizen - optimized via ARM NEON and Accelerate framework
AVX2 support for x86 architectures
Mixed F16 / F32 precision
4-bit quantization support
Runs on the CPU
I have no idea if it works correctly. Please do not make conclusions about the models based on the results from this implementation. For all I know, it can be completely wrong. This project is for educational purposes.

Filename: llama.png Size: 32.02 KB 04-14-2023, 03:20 PM

Hidden Content

Please Log In or Register to view this content.

This board is for authorized security research only. Attacking systems without permission is illegal. The community follows responsible disclosure.