draglord

I am trying to run large LLMs like LLama 3.1 70B and Mixtral 8x22B locally on my system. I have 3x3090 and 4080 Super 16GB.

I read on reddit that people were able to run those models on a single 3090 with 4 bit quantization. I am trying to run the same models using 4 bit quantizations but have been unsuccessful so far.

I tried to run this model for LLama and this model for mixtral. I also tried to run the base model by llama by passing 4 bit quantization as a parameter but no dice.

I was successfully able to run LLama 3.1 7B instruct and Mixtral 7B instruct locally (without 4 bit quantization).

Anyone has any idea how can i run those models?

Blue^_^

how many tk/s are you getting?

tr27

draglord said:
I am trying to run large LLMs like LLama 3.1 70B and Mixtral 8x22B locally on my system. I have 3x3090 and 4080 Super 16GB.

I read on reddit that people were able to run those models on a single 3090 with 4 bit quantization. I am trying to run the same models using 4 bit quantizations but have been unsuccessful so far.

I tried to run this model for LLama and this model for mixtral. I also tried to run the base model by llama by passing 4 bit quantization as a parameter but no dice.

I was successfully able to run LLama 3.1 7B instruct and Mixtral 7B instruct locally (without 4 bit quantization).

Anyone has any idea how can i run those models?

vllm.entrypoints.openai.api_server \
--model casperhansen/llama-3-70b-instruct-awq \
--dtype auto \
--tp 3 \
--engine-use-ray \
--gpu-memory-utilization 0.93

Try with the 3 3090 only

draglord · Sep 26, 2024

tr27 said:
vllm.entrypoints.openai.api_server \
--model casperhansen/llama-3-70b-instruct-awq \
--dtype auto \
--tp 3 \
--engine-use-ray \
--gpu-memory-utilization 0.93

Try with the 3 3090 only

Alright I'll give vllm a try

Blue^_^ said:
how many tk/s are you getting?

sorry i did not see that

tr27 said:
vllm.entrypoints.openai.api_server \
--model casperhansen/llama-3-70b-instruct-awq \
--dtype auto \
--tp 3 \
--engine-use-ray \
--gpu-memory-utilization 0.93

Try with the 3 3090 only

Hi @tr27

I ran the command and vllm serve but nothing happens. I tried to hit the localhost:8000 endpoint but it could never establish a connection. Can you tell me what could i be doing wrong?

thetechguy

@draglord - how are you running all 4 cards ? single machine or eGPU ?

vishalrao

Did you try ollama.com ? There's an option --all-gpus IINM? and if it overflows GPU VRAM it goes some into regular RAM...

draglord

thetechguy said:
@draglord - how are you running all 4 cards ? single machine or eGPU ?

Hi

Yes a single machine. Risers and pcie cable

vishalrao said:
Did you try ollama.com ? There's an option --all-gpus IINM? and if it overflows GPU VRAM it goes some into regular RAM...

Yeah I tried ollama. Easy to use interface but I had 3x3090 and a 4080 so it uses 16 gb on each card.

Plus you cannot download models from hf

ibose

draglord said:
Yes a single machine.

May I ask which case is being used ?

draglord

ibose said:
May I ask which case is being used ?

I think it's Lian li. I'm not sure such y model.

vishalrao

draglord said:
Hi

Yes a single machine. Risers and pcie cable

Yeah I tried ollama. Easy to use interface but I had 3x3090 and a 4080 so it uses 16 gb on each card.

Plus you cannot download models from hf

What other tools you tried apart from ollama.com ? Tried the https://lmstudio.ai app ? It has ability to run as a server/service too and is cross platform. Support HF model downloads too.
Or even just the bare llama.cpp backend but I don't know the feasibility of that...

draglord

No I didn't hear about lmstudio.

I just tried vllm, I figured out how to use it, it supports hf downloads but can only use an even number of gpus.

Party Monger

What's the best way to run LLMs on Mac? Have 64gb ram M2 max

draglord

Party Monger said:
What's the best way to run LLMs on Mac? Have 64gb ram M2 max

Easiest way is to use ollama.

bssunilreddy

draglord said:
Easiest way is to use ollama.

This ollama sounds like Lama sign of Winamp mp3 player I used to use in 2000s.

draglord

Blue^_^

tr27

draglord

thetechguy

vishalrao

Global Moral Police

draglord

ibose

draglord

vishalrao

Global Moral Police

draglord

Party Monger

draglord

bssunilreddy