EXAONE-Deep Deployment Guide

Interactive Deployment Wizard

Answer a few simple questions to get a deployment plan customized for your hardware and use case.

EXAONE-Deep Interactive Deployment Guide

Hardware Environment

Use Case

Recommended Setup

Deployment Guide

Select Your Hardware Environment

GPU Type

Select your GPU type and VRAM size

System Memory

Select your system RAM size

Deployment Overview

EXAONE-Deep models support multiple deployment methods, from local environments to cloud servers, from high-performance devices to resource-constrained environments. You can choose the appropriate deployment solution based on your needs. This guide provides detailed instructions on how to deploy and use EXAONE-Deep models in different scenarios.

Supported Model Versions

EXAONE-Deep 32B
Full version, provides highest performance
EXAONE-Deep 7.8B
Medium size, balances performance and resource requirements
EXAONE-Deep 2.4B
Lightweight version, suitable for resource-constrained environments

Supported Quantization Formats

FP16/BF16
Full precision version, provides highest accuracy
AWQ (4-bit)
Activation-aware weight quantization, significantly reduces memory requirements
GGUF
Universal format for inference frameworks like llama.cpp

Deployment with HuggingFace Transformers

HuggingFace Transformers library provides the simplest way to load and use EXAONE-Deep models. Here are the basic installation and usage steps:

Install Dependencies

bash

pip install torch transformers accelerate

Load and Use the Model

python

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model and tokenizer
model_id = "LGAI-EXAONE/EXAONE-Deep-32B" # Can also be 7.8B or 2.4B version
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto"
)

# Prepare input
prompt = """Solve the following math problem:
For the function f(x) = 3x^2 + 2x - 5, find the value of f(2).

Answer:"""

# Generate response
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.9,
)
response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(response)

Hardware Requirements: For the 32B model, at least 48GB of GPU memory is recommended; the 7.8B model requires at least 12GB of GPU memory; the 2.4B model can run on GPUs with more than 4GB of memory.

Using Quantized Versions

For memory-constrained environments, you can use the AWQ quantized version:

python

pip install autoawq awq

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load AWQ quantized model
model_id = "LGAI-EXAONE/EXAONE-Deep-7.8B-AWQ"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True
)

Hardware Recommendations

Model	Full Precision	With AWQ (4-bit)	With GGUF (Q4_K_M)
EXAONE-Deep 32B	48GB+ VRAM (A100, H100)	12GB+ VRAM (RTX 4090, A10)	16GB RAM for CPU, 8GB+ VRAM
EXAONE-Deep 7.8B	12GB+ VRAM (RTX 4090, A10)	4GB+ VRAM (RTX 3060, T4)	8GB RAM for CPU, 4GB+ VRAM
EXAONE-Deep 2.4B	4GB+ VRAM (RTX 3060, T4)	2GB+ VRAM (GTX 1660)	4GB RAM for CPU, 2GB+ VRAM

Note: These are minimum requirements. For optimal performance, having more memory than the specified minimums is recommended.

Integration with LangChain and LlamaIndex

EXAONE-Deep can be easily integrated with popular frameworks like LangChain and LlamaIndex for advanced applications:

LangChain Integration

python

pip install langchain

from langchain.llms import HuggingFacePipeline
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

# Load the model
model_id = "LGAI-EXAONE/EXAONE-Deep-7.8B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto"
)

# Create a pipeline
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.9,
    repetition_penalty=1.1
)

# Create LangChain LLM
llm = HuggingFacePipeline(pipeline=pipe)

# Use in LangChain
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate

template = """Question: {question}

Answer:"""
prompt = PromptTemplate(template=template, input_variables=["question"])
llm_chain = LLMChain(prompt=prompt, llm=llm)

# Run the chain
question = "What are the key differences between supervised and unsupervised learning?"
print(llm_chain.run(question))

LlamaIndex Integration

python

pip install llama-index

from llama_index.llms import HuggingFaceLLM
from llama_index import VectorStoreIndex, SimpleDirectoryReader, ServiceContext
import torch

# Load documents
documents = SimpleDirectoryReader("./data").load_data()

# Initialize LLM
llm = HuggingFaceLLM(
    model_name="LGAI-EXAONE/EXAONE-Deep-7.8B",
    tokenizer_name="LGAI-EXAONE/EXAONE-Deep-7.8B",
    device_map="auto",
    context_window=2048,
    max_new_tokens=512,
    generate_kwargs={"temperature": 0.7, "top_p": 0.9}
)

# Create service context
service_context = ServiceContext.from_defaults(llm=llm)

# Build index
index = VectorStoreIndex.from_documents(documents, service_context=service_context)

# Query
query_engine = index.as_query_engine()
response = query_engine.query("What are the key points about EXAONE-Deep?")
print(response)