EXAONE-Deep
HomeModel OverviewTechnical FeaturesPerformanceApplicationsDeploymentTutorials

EXAONE-Deep Deployment Guide

Interactive Deployment Wizard

Answer a few simple questions to get a deployment plan customized for your hardware and use case.

EXAONE-Deep Interactive Deployment Guide

Hardware Environment
Use Case
Recommended Setup
Deployment Guide
Select Your Hardware Environment

Select your GPU type and VRAM size

Select your system RAM size

Deployment Overview

EXAONE-Deep models support multiple deployment methods, from local environments to cloud servers, from high-performance devices to resource-constrained environments. You can choose the appropriate deployment solution based on your needs. This guide provides detailed instructions on how to deploy and use EXAONE-Deep models in different scenarios.

Supported Model Versions

  • EXAONE-Deep 32B

    Full version, provides highest performance

  • EXAONE-Deep 7.8B

    Medium size, balances performance and resource requirements

  • EXAONE-Deep 2.4B

    Lightweight version, suitable for resource-constrained environments

Supported Quantization Formats

  • FP16/BF16

    Full precision version, provides highest accuracy

  • AWQ (4-bit)

    Activation-aware weight quantization, significantly reduces memory requirements

  • GGUF

    Universal format for inference frameworks like llama.cpp

Deployment with HuggingFace Transformers

HuggingFace Transformers library provides the simplest way to load and use EXAONE-Deep models. Here are the basic installation and usage steps:

Install Dependencies

bash
pip install torch transformers accelerate

Load and Use the Model

python
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model and tokenizer
model_id = "LGAI-EXAONE/EXAONE-Deep-32B" # Can also be 7.8B or 2.4B version
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto"
)

# Prepare input
prompt = """Solve the following math problem:
For the function f(x) = 3x^2 + 2x - 5, find the value of f(2).

Answer:"""

# Generate response
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.9,
)
response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(response)

Using Quantized Versions

For memory-constrained environments, you can use the AWQ quantized version:

python
pip install autoawq awq

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load AWQ quantized model
model_id = "LGAI-EXAONE/EXAONE-Deep-7.8B-AWQ"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True
)

Hardware Recommendations

ModelFull PrecisionWith AWQ (4-bit)With GGUF (Q4_K_M)
EXAONE-Deep 32B48GB+ VRAM (A100, H100)12GB+ VRAM (RTX 4090, A10)16GB RAM for CPU, 8GB+ VRAM
EXAONE-Deep 7.8B12GB+ VRAM (RTX 4090, A10)4GB+ VRAM (RTX 3060, T4)8GB RAM for CPU, 4GB+ VRAM
EXAONE-Deep 2.4B4GB+ VRAM (RTX 3060, T4)2GB+ VRAM (GTX 1660)4GB RAM for CPU, 2GB+ VRAM

Note: These are minimum requirements. For optimal performance, having more memory than the specified minimums is recommended.

Integration with LangChain and LlamaIndex

EXAONE-Deep can be easily integrated with popular frameworks like LangChain and LlamaIndex for advanced applications:

LangChain Integration

python
pip install langchain

from langchain.llms import HuggingFacePipeline
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

# Load the model
model_id = "LGAI-EXAONE/EXAONE-Deep-7.8B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto"
)

# Create a pipeline
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.9,
    repetition_penalty=1.1
)

# Create LangChain LLM
llm = HuggingFacePipeline(pipeline=pipe)

# Use in LangChain
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate

template = """Question: {question}

Answer:"""
prompt = PromptTemplate(template=template, input_variables=["question"])
llm_chain = LLMChain(prompt=prompt, llm=llm)

# Run the chain
question = "What are the key differences between supervised and unsupervised learning?"
print(llm_chain.run(question))

LlamaIndex Integration

python
pip install llama-index

from llama_index.llms import HuggingFaceLLM
from llama_index import VectorStoreIndex, SimpleDirectoryReader, ServiceContext
import torch

# Load documents
documents = SimpleDirectoryReader("./data").load_data()

# Initialize LLM
llm = HuggingFaceLLM(
    model_name="LGAI-EXAONE/EXAONE-Deep-7.8B",
    tokenizer_name="LGAI-EXAONE/EXAONE-Deep-7.8B",
    device_map="auto",
    context_window=2048,
    max_new_tokens=512,
    generate_kwargs={"temperature": 0.7, "top_p": 0.9}
)

# Create service context
service_context = ServiceContext.from_defaults(llm=llm)

# Build index
index = VectorStoreIndex.from_documents(documents, service_context=service_context)

# Query
query_engine = index.as_query_engine()
response = query_engine.query("What are the key points about EXAONE-Deep?")
print(response)