draglord

Forerunner
I am trying to run large LLMs like LLama 3.1 70B and Mixtral 8x22B locally on my system. I have 3x3090 and 4080 Super 16GB.

I read on reddit that people were able to run those models on a single 3090 with 4 bit quantization. I am trying to run the same models using 4 bit quantizations but have been unsuccessful so far.

I tried to run this model for LLama and this model for mixtral. I also tried to run the base model by llama by passing 4 bit quantization as a parameter but no dice.

I was successfully able to run LLama 3.1 7B instruct and Mixtral 7B instruct locally (without 4 bit quantization).

Anyone has any idea how can i run those models?
 
I am trying to run large LLMs like LLama 3.1 70B and Mixtral 8x22B locally on my system. I have 3x3090 and 4080 Super 16GB.

I read on reddit that people were able to run those models on a single 3090 with 4 bit quantization. I am trying to run the same models using 4 bit quantizations but have been unsuccessful so far.

I tried to run this model for LLama and this model for mixtral. I also tried to run the base model by llama by passing 4 bit quantization as a parameter but no dice.

I was successfully able to run LLama 3.1 7B instruct and Mixtral 7B instruct locally (without 4 bit quantization).

Anyone has any idea how can i run those models?
vllm.entrypoints.openai.api_server \
--model casperhansen/llama-3-70b-instruct-awq \
--dtype auto \
--tp 3 \
--engine-use-ray \
--gpu-memory-utilization 0.93

Try with the 3 3090 only
 
vllm.entrypoints.openai.api_server \
--model casperhansen/llama-3-70b-instruct-awq \
--dtype auto \
--tp 3 \
--engine-use-ray \
--gpu-memory-utilization 0.93

Try with the 3 3090 only

Alright I'll give vllm a try
how many tk/s are you getting?
sorry i did not see that
vllm.entrypoints.openai.api_server \
--model casperhansen/llama-3-70b-instruct-awq \
--dtype auto \
--tp 3 \
--engine-use-ray \
--gpu-memory-utilization 0.93

Try with the 3 3090 only
Hi @tr27

I ran the command and vllm serve but nothing happens. I tried to hit the localhost:8000 endpoint but it could never establish a connection. Can you tell me what could i be doing wrong?
 
Last edited:
Did you try ollama.com ? There's an option --all-gpus IINM? and if it overflows GPU VRAM it goes some into regular RAM...
 
@draglord - how are you running all 4 cards ? single machine or eGPU ?
Hi

Yes a single machine. Risers and pcie cable
Did you try ollama.com ? There's an option --all-gpus IINM? and if it overflows GPU VRAM it goes some into regular RAM...
Yeah I tried ollama. Easy to use interface but I had 3x3090 and a 4080 so it uses 16 gb on each card.

Plus you cannot download models from hf
 
Hi

Yes a single machine. Risers and pcie cable

Yeah I tried ollama. Easy to use interface but I had 3x3090 and a 4080 so it uses 16 gb on each card.

Plus you cannot download models from hf

What other tools you tried apart from ollama.com ? Tried the https://lmstudio.ai app ? It has ability to run as a server/service too and is cross platform. Support HF model downloads too.
Or even just the bare llama.cpp backend but I don't know the feasibility of that...
 
  • Like
Reactions: sporkydork
No I didn't hear about lmstudio.

I just tried vllm, I figured out how to use it, it supports hf downloads but can only use an even number of gpus.
 
What's the best way to run LLMs on Mac? Have 64gb ram M2 max