Know what uses less? No LLMs
Yay, I’m doing my part!
Try using a 1-bit LLM to test the article’s claim.
The perplexity loss is staggering. It’s like 75% accuracy lost or more. It turns a 30 billion parameter model into a 7 billion parameter model.
Highly recommended that you try to replicate their results.
But since it takes 10% of the space (vram, etc.) sounds like they could just start with a larger model and still come out ahead
There’s actually a perplexity improvement parameter-to-paramater for BitNet-1.58 which increases as it scales up.
So yes, post-training quantization perplexity issues are apparent, but if you train quantization in from the start it is better than FP.
Which makes sense through the lens of the superposition hypothesis where the weights are actually representing a hyperdimensional virtual vector space. If the weights have too much precision competing features might compromise on fuzzier representations instead of restructuring the virtual network to better matching nodes.
Constrained weight precision is probably going to be the future of pretraining within a generation or two looking at the data so far.
We invented multi bit models so we could get more accuracy since neural networks are based off human brains which are 1 bit models themselves. A 2 bit neuron is 4 times as capable as a 1 bit neuron but only double the size and power requirements. This whole thing sounds like bs to me. But then again maybe complexity is more efficient than per unit capability since thats the tradeoff.
Human brains aren’t 1 bit models. Far from it actually, I am not an expert though but I know that neurons in the brain encode different signal strengths in their firing frequency.
Firing of on and off.
We really don’t know jack shit, but we know more than enough to know fire rate is hugely important.
The network architecture seems to create a virtualized hyperdimensional network on top of the actual network nodes, so the node precision really doesn’t matter much as long as quantization occurs in pretraining.
If it’s post-training, it’s degrading the precision of the already encoded network, which is sometimes acceptable but always lossy. But being done at the pretrained layer it actually seems to be a net improvement over higher precision weights even if you throw efficiency concerns out the window.
You can see this in the perplexity graphs in the BitNet-1.58 paper.
None of those words are in the bible
No, but some alarmingly similar ideas are in the heretical stuff actually.
We need to scale fusion
Making ai more efficient will just mean more ai
Smaller and speedier means larger token windows and greater variety of models.
Not less energy.
deleted by creator