EXAONE-Deep Deployment Guide
Interactive Deployment Wizard
Answer a few simple questions to get a deployment plan customized for your hardware and use case.
EXAONE-Deep Interactive Deployment Guide
Select Your Hardware Environment
Select your GPU type and VRAM size
Select your system RAM size
Deployment Overview
EXAONE-Deep models support multiple deployment methods, from local environments to cloud servers, from high-performance devices to resource-constrained environments. You can choose the appropriate deployment solution based on your needs. This guide provides detailed instructions on how to deploy and use EXAONE-Deep models in different scenarios.
Supported Model Versions
- EXAONE-Deep 32B
Full version, provides highest performance
- EXAONE-Deep 7.8B
Medium size, balances performance and resource requirements
- EXAONE-Deep 2.4B
Lightweight version, suitable for resource-constrained environments
Supported Quantization Formats
- FP16/BF16
Full precision version, provides highest accuracy
- AWQ (4-bit)
Activation-aware weight quantization, significantly reduces memory requirements
- GGUF
Universal format for inference frameworks like llama.cpp
Deployment with HuggingFace Transformers
HuggingFace Transformers library provides the simplest way to load and use EXAONE-Deep models. Here are the basic installation and usage steps:
Install Dependencies
pip install torch transformers accelerate
Load and Use the Model
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load model and tokenizer
model_id = "LGAI-EXAONE/EXAONE-Deep-32B" # Can also be 7.8B or 2.4B version
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype="auto",
device_map="auto"
)
# Prepare input
prompt = """Solve the following math problem:
For the function f(x) = 3x^2 + 2x - 5, find the value of f(2).
Answer:"""
# Generate response
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=512,
temperature=0.7,
top_p=0.9,
)
response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(response)
Using Quantized Versions
For memory-constrained environments, you can use the AWQ quantized version:
pip install autoawq awq
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load AWQ quantized model
model_id = "LGAI-EXAONE/EXAONE-Deep-7.8B-AWQ"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
trust_remote_code=True
)
Hardware Recommendations
Model | Full Precision | With AWQ (4-bit) | With GGUF (Q4_K_M) |
---|---|---|---|
EXAONE-Deep 32B | 48GB+ VRAM (A100, H100) | 12GB+ VRAM (RTX 4090, A10) | 16GB RAM for CPU, 8GB+ VRAM |
EXAONE-Deep 7.8B | 12GB+ VRAM (RTX 4090, A10) | 4GB+ VRAM (RTX 3060, T4) | 8GB RAM for CPU, 4GB+ VRAM |
EXAONE-Deep 2.4B | 4GB+ VRAM (RTX 3060, T4) | 2GB+ VRAM (GTX 1660) | 4GB RAM for CPU, 2GB+ VRAM |
Note: These are minimum requirements. For optimal performance, having more memory than the specified minimums is recommended.
Integration with LangChain and LlamaIndex
EXAONE-Deep can be easily integrated with popular frameworks like LangChain and LlamaIndex for advanced applications:
LangChain Integration
pip install langchain
from langchain.llms import HuggingFacePipeline
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
# Load the model
model_id = "LGAI-EXAONE/EXAONE-Deep-7.8B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype="auto",
device_map="auto"
)
# Create a pipeline
pipe = pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
max_new_tokens=512,
temperature=0.7,
top_p=0.9,
repetition_penalty=1.1
)
# Create LangChain LLM
llm = HuggingFacePipeline(pipeline=pipe)
# Use in LangChain
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
template = """Question: {question}
Answer:"""
prompt = PromptTemplate(template=template, input_variables=["question"])
llm_chain = LLMChain(prompt=prompt, llm=llm)
# Run the chain
question = "What are the key differences between supervised and unsupervised learning?"
print(llm_chain.run(question))
LlamaIndex Integration
pip install llama-index
from llama_index.llms import HuggingFaceLLM
from llama_index import VectorStoreIndex, SimpleDirectoryReader, ServiceContext
import torch
# Load documents
documents = SimpleDirectoryReader("./data").load_data()
# Initialize LLM
llm = HuggingFaceLLM(
model_name="LGAI-EXAONE/EXAONE-Deep-7.8B",
tokenizer_name="LGAI-EXAONE/EXAONE-Deep-7.8B",
device_map="auto",
context_window=2048,
max_new_tokens=512,
generate_kwargs={"temperature": 0.7, "top_p": 0.9}
)
# Create service context
service_context = ServiceContext.from_defaults(llm=llm)
# Build index
index = VectorStoreIndex.from_documents(documents, service_context=service_context)
# Query
query_engine = index.as_query_engine()
response = query_engine.query("What are the key points about EXAONE-Deep?")
print(response)