Llama3와 cromadb를 이용한 RAG 개념 검증

Recommanded Free YOUTUBE Lecture: <% selectedImage[1] %>

yundream
2024-06-16
2024-06-16
2586

### 소개 
**RAG(Retrieval Augmented Generation)** 은 생성 기반 모델과 검색 기반 모델의 장점을 결합하여 고품질의 응답을 생성하는 기술이다. RAG의 주요 목표는 전문 도메인 정보를 기반으로 기존의 LLM이 응답하기 힘든 고품질의 텍스트 응답을 생성하는 것이다.

여기에서는 Ollama와 Meta의 Llama3 그리고 Cromadb 벡터 데이터베이스를 이용해서 RAG를 구현해 볼 것이다. Ollama, 벡터 데이터베이스와 RAG에 대한 내용은 아래의 문서를 참고바란다.
1. [Joinc와 함께하는 LLM - 개인 PC에 LLM 환경 구축하기](https://www.joinc.co.kr/w/ollama_setting) 
2. [Joinc와 함께하는 LLM - vector embedding 기본](https://www.joinc.co.kr/w/vector_embedding_basic) 
3. [Joinc와 함께하는 LLM - LangChain과 RAG](https://www.joinc.co.kr/w/LangChain-intro)

### Llama 3 환경 검토
[Joinc와 함께하는 LLM - 개인 PC에 LLM 환경 구축하기](https://www.joinc.co.kr/w/ollama_setting) 문서를 기반으로 로컬 PC에 Docker 기반으로 Ollama와 Llama 3 모델 설치를 끝냈다고 가정하고 진행한다. Ollama 컨테이너의 이름은 ollama다. Llama 3 모델이 설치되어 있는지 확인해보자.
```
$ docker exec -it ollama ollama list                  
NAME                    	ID          	SIZE  	MODIFIED    
llama3:latest           	365c0bd3c000	4.7 GB	12 days ago	
```
### RAG 프로세스 
아래 그림은 RAG 프로세스를 묘사하고 있다.
![RAG 프로세스](https://docs.google.com/drawings/d/e/2PACX-1vR5rgkzl-X5nzEIdvsjanxTkkpiy80fcpBXAucTb08kCmR5QfmUCf1b6jV3KG5dol6VZrdg0I-ZDY-4/pub?w=1152&h=528)
1. PDF, TEXT, Word 등 문서를 준비하고 로딩한다.
2. 문서를 적당한 크기로 쪼갠다.
3. 쪼갠 문서를 임베딩 모델을 이용해서 vectorize 한다.
4. vectorize된 문서를 Vector Database에 저장한다.
5. 사용자가 질문을 하면, 이 질문을 vectorize 한다.
6. Vector db에 질의 한다.
7. 관련 문서를 리턴한다.
8. LLM은 관련 문서를 수집하고
9. 프롬프트 템플릿을 이용해서 응답을 생성한다.

### 벡터 모델 준비
Ollama를 이용하면 로컬에 벡터 모델을 생성 할 수 있다. 2024년 현재 Ollama는 4개의 벡터 모델을 제공하는데, **nomic-embed-text** 모델을 사용하기로 했다.

| Model                    | Pull                                      | Ollama Registry Link                                                        |
| ------------------------ | ----------------------------------------- | --------------------------------------------------------------------------- |
| `nomic-embed-text`       | `ollama pull nomic-embed-text`            | [nomic-embed-text](https://ollama.com/library/nomic-embed-text)             |
| `mxbai-embed-large`      | `ollama pull mxbai-embed-large`           | [mxbai-embed-large](https://ollama.com/library/mxbai-embed-large)           |
| `snowflake-arctic-embed` | `ollama pull snowflake-arctic-embed`      | [snowflake-arctic-embed](https://ollama.com/library/snowflake-arctic-embed) |
| `all-minilm-l6-v2`       | `ollama pull chroma/all-minilm-l6-v2-f32` | [all-minilm-l6-v2-f32](https://ollama.com/chroma/all-minilm-l6-v2-f32)      |
```
$ docker exec -it ollama ollama pull nomic-embed-text 
$ docker exec -it ollama ollama list                 
NAME                    	ID          	SIZE  	MODIFIED       
llama3:latest           	365c0bd3c000	4.7 GB	12 days ago   	
mxbai-embed-large:latest	468836162de7	669 MB	9 days ago    	
nomic-embed-text:latest 	0a109f422b47	274 MB	26 seconds ago	
```

### Vectorize
이제 문서를 벡터 데이터베이스에 저장을 해보자. 이 과정을 **문서 vectorize** 라고 하며 크게 4개의 단계를 거친다.
![Vectorize](https://docs.google.com/drawings/d/e/2PACX-1vSWFPfv5zVyiBjSjqXnnGXeJtEjM3MsIkKfxGjdOiZH0OFTireQhEC9WRjpWdTRcatDbD7JSyVCNjOy/pub?w=1162&h=348)
**문서 Load** 벡터 데이터베이스에 저장할 문서를 로드 한다. 여기에서는 [Vector Database: What is it and why you should know it?](https://medium.com/@EjiroOnose/vector-database-what-is-it-and-why-you-should-know-it-ae7e7dca82a4#:~:text=Vector%20databases%20can%20be%20used,not%20all%20vectors%20are%20embeddings) 문서를 테스트에 사용 할 것이다. 해당 문서는 [joinc github](https://github.com/joinc-channel/llm-with-joinc/tree/main/ollama_rag_with_cormadb) 에서 다운로드 할 수 있다.

**Split** 문서를 일정 크기를 가지는 더 작은 단위로 (chunk)쪼갠다. 그 이유는 아래와 같다.  
* 메모리 및 성능 최적화: LLM은 일반적으로 많은 양의 메모리를 필요로 한다. 긴 문서를 한번에 처리하면 메모리 사용량이 급격히 증가 할 수 있다. 문서를 청크로 나누면 각 청크를 개별적으로 처리 할 수 있어서 메모리를 효과적으로 사용 할 수 있다.
* 맥락 유지: LLM은 입력 문장의 길이에 제한이 있다. 예를 들어 GPT-4의 경우 8K, 32K 등 토큰 수가 제한이 되어 있기 때문에 긴 문서를 한번에 처리 할 수 없다. 문서를 청크로 나누면 모델이 각 청크 내에서 맥락을 유지하면서 문서를 처리 할 수 있다.
* 병렬처리: 문서를 청크단위로 나누면 병렬처리가 가능하여 처리 속도를 크게 향상시킬 수 있다.
* 정확성 향상: 문서가 커지면 모델이 주제를 일관되게 따라가기 어려울 수 있다. 문서를 청크로 나누면 보다 일관된 처리가 가능하다.

```python
import os
from dotenv import load_dotenv
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import CharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_pinecone import PineconeVectorStore

load_dotenv()

if __name__ == "__main__":
    print("Ingesting...")
    loader = TextLoader("./medium_blog.txt")
    document = loader.load()

print("splitting...")

text_splitter = CharacterTextSplitter(chunk_size = 1000, chunk_overlap=0)
    texts = text_splitter.split_documents(document)
    print("splitting completing", len(texts))
```
* **langchain_community.document_loaders** 패키지는 PDF, Word, HTML, Text 등 다양한 형식의 문서를 로드하고 처리하는데 사용하는 모듈이다. 
* TextLoader: 일반적인 Text 문서를 로드하는데 사용한다.
* CharacterTextSplitter: chunk의 크기는 1000으로 했다. 텍스트는 최대 1000자 길이의 청크로 분할된다. chunk overlap은 0으로 했다. chunk overlap은 인접한 청크간에 겹치는 문자의 수를 설정하기 위해서 사용한다. 이는 청크 사이의 연속성을 유지하여 문맥을 잃지 않도록 하기 위해서 사용한다.

코드를 실행해보자.
```
$ python app.py
Ingesting...
splitting...
Created a chunk of size 1180, which is longer than the specified 1000
Created a chunk of size 1058, which is longer than the specified 1000
splitting completing 16
```
총 16개의 chunk 문서로 분리됐다. 그리고 2개의 청크는 청크 크기 1000을 초과했다.

### 벡터 임베딩 후 벡터 데이터베이스에 저장
이렇게 만들어진 16개의 청크 문서를 벡터 모델을 이용해서 임베딩 후, 벡터 데이터베이스에 저장한다. 여기에서 벡터 임베딩과 벡터 데이터베이스에 대해서 헷갈릴 수 있을 것 같아서 간단히 설명을 해야 할 것 같다.

**벡터 데이터베이스**는 벡터 데이터에 대한 CRUD를 하는 데이터 베이스의 역할을 할 뿐이다. 즉 문서에 대한 벡터화는 직접 해줘야 하는데, 이 문서에 대한 벡터화를 **벡터 임베딩** 이라고 한다.

![벡터 임베딩](https://docs.google.com/drawings/d/e/2PACX-1vQv1KbOMvu0qOrbuLgG4MdqgrJ7e8hLUg3XwvppeUNTSDQXGJra40lheFF221k3kkiXf1KswES8Kfhe/pub?w=723&h=319)

벡터 임베딩을 하고 나면 숫자 배열이 리턴되는데, 이 배열을 벡터 데이터베이스에 저장하는 것이다. 벡터 임베딩 후 해당 벡터를 cromadb에 저장해보자.

```python
import os
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import CharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain.embeddings import OllamaEmbeddings
from langchain.vectorstores import Chroma

if __name__ == "__main__":
    print("Ingesting...")
    loader = TextLoader("./medium_blog.txt")
    document = loader.load()

print("splitting...")

text_splitter = CharacterTextSplitter(chunk_size = 1000, chunk_overlap=0)
    documents = text_splitter.split_documents(document)

oembed = OllamaEmbeddings(
        base_url= "http://localhost:11434", 
        model="nomic-embed-text"
    )

vectorstore = Chroma.from_documents(persist_directory="chromadb",documents=documents, embedding=oembed)
    question="What is pinecone"
    docs = vectorstore.similarity_search(question)
    print(len(docs))
```
Chroma.from_documents: 청크된 문서목록을 임베딩한다. 코드를 실행하면 현재 디렉토리에 **chromadb 디렉토리**가 만들어지고 여기에 벡터 파일을 저장한다.

### Retrieval
![Retrieval](https://docs.google.com/drawings/d/e/2PACX-1vQp3kDJjGtV4ctV700VNyWCoBn5vKx-nC0c_8b6eochUqBVnix6LkvVwIbyk5KQhmjAQDZmVbRXK93W/pub?w=1543&h=330)
모든 준비가 끝났다. 이제 RAG를 테스트 해보자. 먼저 RAG를 사용하지 않았을 때의 응답을 테스트해보자.
```
$ curl http://localhost:11434/api/generate -d '{
  "model": "llama3",
  "prompt": "What is pinecone",
  "stream": false
}'

A pinecone!\n\nA pinecone, also known as a pinyon or piñon, is the reproductive structure of certain species of pine trees (genus Pinus). It's a specialized organ that produces seeds, also known as cones. Here are some fun facts about pinecones:\n\n1. **Seed production**: Pinecones produce seeds, which are an important food source for many animals, such as birds, squirrels, and deer.\n2. **Shape and size**: Pinecones can vary in shape (e.g., cylindrical, oval) and size (from 2 to 10 inches or 5 to 25 cm long). The shape and size depend on the pine species.\n3. **Seed structure**: Each seed is attached to a modified branch called a scale, which is covered with a tough, waxy coating. The scales are often in a spiral pattern, helping to spread seeds when they open.\n4
```
솔방울에 대한 정보를 출력하는 걸 확인 할 수 있다. 우리가 원하는 것은 솔방울이 아니라 pinecon 벡터 데이터베이스에 대한 내용이다. 아래 코드를 실행해보자.

```python
from langchain.chains import RetrievalQA
from langchain.llms import Ollama
from langchain.embeddings import OllamaEmbeddings
from langchain.vectorstores import Chroma

embedding = OllamaEmbeddings(
    base_url= "http://localhost:11434", 
    model="nomic-embed-text"
)
db = Chroma(persist_directory="chromadb", embedding_function=embedding)
docs = db.similarity_search("What is pinecone")
print(docs)
```
* OllamaEmbeddings: nomic-embed-text 벡터 모델을 로딩한다.  
* Chroma: Chroma 데이터베이스 인스턴스를 만든다. 데이터베이스 디렉토리는 chromadb 다. 
* similarity_search: 벡터 임베딩을 기반으로 유사도 검색을 한다.
결과는 아래와 같다.
```
[Document(page_content='Pinecone is designed to be fast and scalable, allowing for efficient retrieval of similar data points based on their vector representations.\nIt can handle large-scale ML applications with millions or billions of data points.\nPinecone provides infrastructure management or maintenance to its users.\nPinecone can handle high query throughput and low latency search.\nPinecone is a secure platform that meets the security needs of businesses and organizations.\nPinecone is designed to be user-friendly and accessible via its simple API for storing and retrieving vector data, making it easy to integrate into existing ML workflows... // 생략
... in the cloud.\n\n\nFeatures:', metadata={'source': './medium_blog.txt'})]
```
Pinecone 벡터 데이터베이스의 내용을 올바로 리턴하는 것을 확인 할 수 있다. 또한 문서의 출처도 가져오고 있다.
### 정리 
이렇게 해서 ollama, Llama 3, ChromaDb, LangChain을 이용해서 간단하게 RAG를 구성했다. 다음 문서에서는 sLLM과 원하는 형태의 응답을 얻기 위한 프롬프트 엔지니어링에 대해서 살펴보도록 하겠다.

Search For:

BY TAGS

Recent Posts

Archive Posts

Tags

About

Get in Touch

Categories