Not to go full "Dropbox in a weekend", but if you're technical enough to self-ho...

AnthonyMouse · on April 2, 2023

Is that (i.e. GPT) not still a service?

What people want is something they can run on their own hardware without sending their queries to some third party service which is doing who knows what with them.

This is already possible if you want to mess around with green code that isn't in system repositories yet and buy expensive hardware to make it fast, but you can imagine why some people don't have the time or money for that.

I'm waiting for Intel or AMD to realize there would be a line out the door if they'd make a CPU with an iGPU that could use system memory and run these models at even a quarter of the speed of typical discrete GPUs.

BoorishBears · on April 2, 2023

I mean you don't need to use GPT, it's just if you wanted to build the product in OP (ChatGPT tuned for your site) you would.

Question answering can be tackled by smaller models that run on CPUs: https://huggingface.co/tasks/question-answering

And if it's strictly for personal use there's always the chat-tuned stuff being built on top of LLaMA like Alpaca

> waiting for Intel or AMD to realize

Intel and AMD just got their lunch eaten by Apple Silicon which did exactly that, so I'm sure they're working on it

AnthonyMouse · on April 2, 2023

> I mean you don't need to use GPT, it's just if you wanted to build the product in OP (ChatGPT tuned for your site) you would.

Hence the demand for something else.

> Intel and AMD just got their lunch eaten by Apple Silicon which did exactly that, so I'm sure they're working on it

Apple's GPU doesn't benchmark much different than competing iGPUs for gaming. It may be that the only thing stopping anyone from running this stuff on existing iGPUs is software support.

BoorishBears · on April 2, 2023

> Hence the demand for something else.

Something else would not be ChatGPT tuned for your site. Like I said there are other models, but a lot of people want ChatGPT as they currently interact with it but with additional knowledge about their website. This is that.

> Apple's GPU doesn't benchmark much different than competing iGPUs for gaming.

GPGPU is not gaming. Unified memory means that Apple Silicon's "RAM" can be compared to VRAM for inference.

AnthonyMouse · on April 2, 2023

> Something else would not be ChatGPT tuned for your site.

I suspect a lot of people would be satisfied with anything functionally equivalent regardless of whether it is ChatGPT(TM)-brand.

> GPGPU is not gaming. Unified memory means that Apple Silicon's "RAM" can be compared to VRAM for inference.

The M1 and M2 have a 128-bit memory bus, the same as ordinary dual-channel systems. Only the Pro and Max have more (by 2x and 4x), and it's not obvious that's even the bottleneck here, because the reason they have more is to have enough for the GPU and CPU at the same time, not because a GPU of that size needs that much memory bandwidth when the CPU is idle.

For example, the RTX 4070 Ti is about twice as fast at inference as the RTX 3070 Ti, even though it has slightly less memory bandwidth. And the 4070 Ti has only ~25% more memory bandwidth than the M2 Max GPU but is many times faster.

There is presumably a point at which inference becomes bottlenecked by memory bandwidth rather than compute hardware, but the garden variety x86_64 iGPU may not even be past it, and if it is it's not by much.

The interesting things are a) getting the code written to make existing hardware easy to use, and b) maybe introducing some hefty iGPUs into the server systems with 12 channels per socket and wouldn't run out of memory bandwidth even with significantly more compute hardware, and could then be supplied with hundreds of GB worth of RDIMMs.