Leveraging Clustering and Vector Selection for Cost-Effective LLM Summarisation Tasks
Are you grappling with the limitations of context window lengths in Large Language Models (LLMs)? Struggling to condense volumes of text within the confines of token limits and soaring costs when summarizing lengthy documents like books, scripts, reviews or research papers?
Problem
The issue at hand is the limitation of context window length in Large Language Models (LLMs). Each LLM is restricted by a predetermined context length, meaning prompts must stay within this limit. Additionally, the use of longer contexts incurs greater cost.
Let us understand with an example
Let’s say you are tasked with a summarisation task. It can be summarising a book, a long research paper or maybe a bunch of reviews. In this example we are considering a use case where we are summarising reviews of a movie called Interstellar (one of my all time favourites). This summarised review should fully encompass what is said in all reviews.
Setup
I have prepared this JSON file (w/ help from ChatGPT) which contains 206 reviews of Interstellar. To help readers understand the structure of JSON, Here is how the file looks like-
Let’s load up the JSON file and see how many tokens are we talking about when we want to summarise all reviews
from langchain import OpenAI
from google.colab import userdata
from langchain_community.document_loaders import JSONLoader
llm = OpenAI(temperature=0, openai_api_key=userdata.get('OPENAI_KEY'))
# Picking all reviews from list of reviews
loader=JSONLoader(file_path="movie_review.json",jq_schema=".reviews[].review", text_content=False)
docs=loader.load()
text = ""
for doc in docs:
text += doc.page_content
print(llm.get_num_tokens(text)) # 10329
That’s a whopping ~10K tokens. First of all I won’t be able to fit 10K tokens in GPT-4 8K context window models (e.g. gpt-4
and gpt-4-0314
)
Secondly, it would cost us $0.60 for GPT-4 32K context window models (e.g. gpt-4-32k
and gpt-4-32k-0314
). This is just for prompt tokens. We not even talking about sampled tokens yet.
You could imagine how expensive it can get when summarising long texts. Given this situation, summarising extended documents like books or research papers, which may exceed 100K tokens, could become prohibitively costly.
What can be done to reduce cost?
Solution
Without much deliberation, we can assume that if there is a way which can skip a few tokens and yet represent a wholistic but diverse view of all reviews can be helpful
Objective : Get vector embeddings of each review. Pick a subset of vectors which represent a wholistic but diverse view of all reviews. In other words, is there a way to pick the top 10 reviews that describe all reviews the best?
Once we have a subset of vectors which represent all reviews we can then summarise all those vectors to get a pretty good summary of all reviews.
Keep in mind that this is not a deterministic but a probabilistic solution. There can still be some information loss but then that’s the essence of summarisation — You loose some information.
This can be done. Let me explain how.
We can follow this simple algorithm. Let’s call it Centroid based vector selection (CBVS). Here are steps of this algorithm-
- Embedding — Embed each review to get vectors.
- Clustering — Cluster the vectors to see which are similar to each other. These are the reviews that likely are similar.
- Vector selection — Pick embeddings that represent the cluster the most (method: closest to each cluster centroid)
- Summarising selected docs— Summarise the documents that these embeddings represent. You will end up with a subset of summarised reviews.
- Summarisation of summaries — Create a summary of summaries to get a final summary.
Let’s go through each step of CBVS algorithm and see how we can implement it. We will be using langchain.
1. Embedding — Embed each review to get vectors.
We will be using OpenAIEmbeddings
of langchain. This will use OpenAI’s text-embedding-ada-002
model to create embeddings with 1536 dimensions.
from langchain.embeddings import OpenAIEmbeddings
embeddings = OpenAIEmbeddings(openai_api_key=userdata.get('OPENAI_KEY'))
vectors = embeddings.embed_documents([x.page_content for x in docs])
2. Clustering — Cluster the vectors to see which are similar to each other. These are the reviews that likely are similar.
Now let’s cluster our embeddings. There are a ton of clustering algorithms you can chose from. Please try a few out to see what works best for you!
import numpy as np
from sklearn.cluster import KMeans
# WIth some trial and error, I found out ~8 clusters would be optimal for my use case
num_clusters = 8
# Perform K-means clustering
kmeans = KMeans(n_clusters=num_clusters, random_state=42).fit(vectors)
print(kmeans.labels_)
Output-
array([0, 5, 4, 5, 5, 4, 3, 5, 0, 5, 3, 7, 5, 5, 5, 1, 2, 5, 1, 5, 0, 0,
0, 0, 5, 0, 0, 0, 7, 1, 0, 0, 2, 0, 5, 2, 0, 1, 0, 2, 6, 3, 3, 0,
4, 3, 3, 4, 2, 3, 4, 2, 6, 5, 0, 7, 6, 4, 4, 2, 7, 3, 4, 4, 6, 4,
2, 6, 4, 2, 3, 4, 4, 6, 4, 2, 6, 4, 2, 6, 4, 4, 6, 4, 2, 6, 4, 2,
7, 5, 0, 0, 5, 0, 7, 1, 7, 0, 4, 2, 3, 1, 7, 7, 0, 3, 7, 1, 0, 3,
7, 4, 1, 7, 3, 2, 7, 0, 1, 2, 0, 1, 0, 7, 4, 2, 7, 7, 2, 0, 0, 3,
7, 4, 1, 7, 3, 2, 7, 0, 1, 2, 0, 1, 0, 7, 4, 1, 2, 6, 0, 1, 5, 0,
0, 5, 3, 1, 7, 4, 1, 4, 1, 7, 2, 7, 1, 4, 1, 2, 1, 7, 1, 7, 7, 1,
7, 1, 2, 1, 4, 7, 7, 1, 7, 1, 7, 7, 1, 4, 1, 7, 7, 7, 1, 7, 1, 7,
1, 7, 1, 1, 1, 1, 1, 1], dtype=int32)
Let’s see what is the distribution of each label-
arr = kmeans.labels_
unique, counts = np.unique(arr, return_counts=True)
distribution = dict(zip(unique, counts))
print(distribution)
Output-
{0: 33, 1: 38, 2: 24, 3: 16, 4: 29, 5: 17, 6: 11, 7: 38}
Let’s also plot these on a graph. Since these are vectors of 1536 dimesnsions, we first need to reduce the dimensionality to 2
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
vectors = np.array(vectors)
# Perform t-SNE and reduce to 2 dimensions
tsne = TSNE(n_components=2, random_state=42)
reduced_data_tsne = tsne.fit_transform(vectors)
# Plot the reduced data
plt.scatter(reduced_data_tsne[:, 0], reduced_data_tsne[:, 1], c=kmeans.labels_)
plt.xlabel('Dimension 1')
plt.ylabel('Dimension 2')
plt.title('Interstellar reviews embedding Clustered')
plt.show()
Output-
This is pretty neat. We are on the right direction. Let’s move on to the next step
3. Vector selection — Pick embeddings that represent the cluster the most (method: closest to each cluster centroid)
We will pick the embedding closest to the centroid of the cluster because I assume that would represent the most average meaning of that cluster. Based on your use case you can tweak which embeddings from the cluster you want to pick.
# Find the closest embeddings to the centroids
closest_indices = []
# Loop through the number of clusters you have
for i in range(num_clusters):
# Get the list of distances from that particular cluster center
distances = np.linalg.norm(vectors - kmeans.cluster_centers_[i], axis=1)
# Find the list position of the closest one (using argmin to find the smallest distance)
closest_index = np.argmin(distances)
# Append that position to your closest indices list
closest_indices.append(closest_index)
4. Summarising selected docs — Summarise the documents that these embeddings represent. You will end up with a subset of summarised reviews.
I’m going to initialize two models, gpt-3.5 and gpt4. I’ll use gpt 3.5 for the first set of summaries to reduce cost and then gpt4 for the final pass which should increase the quality.
from langchain.chat_models import ChatOpenAI
from langchain import PromptTemplate
from langchain.chains.summarize import load_summarize_chain
llm3 = ChatOpenAI(temperature=0,
openai_api_key=userdata.get('OPENAI_KEY'),
max_tokens=1000,
model='gpt-3.5-turbo'
)
map_prompt = """
You will be given a single review of a movie. This section will be enclosed in triple backticks (```)
Your goal is to give a summary of this review so that a reader will have a full understanding of what is the review talking about
Your response should be at least 50 words and fully encompass what was said in the passage.
```{text}```
FULL SUMMARY:
"""
map_prompt_template = PromptTemplate(template=map_prompt, input_variables=["text"])
map_chain = load_summarize_chain(llm=llm3,
chain_type="stuff",
prompt=map_prompt_template)
# Then go get your docs which the top vectors represented.
closest_docs = [docs[doc] for doc in closest_indices]
# Let's loop through our selected docs and get a good summary for each doc. We'll store the summary in a list.
summary_list = []
# Loop through a range of the length of your selected docs
for i, doc in enumerate(closest_docs):
# Go get a summary of the review
chunk_summary = map_chain.run([doc])
# Append that summary to your list
summary_list.append(chunk_summary)
print (f"Summary #{i} (chunk #{closest_indices[i]}) - Preview: {chunk_summary[:250]} \n")
Output —
Summary #0 (chunk #36) - Preview: The review of "Interstellar" praises the movie for its mind-bending journey through the cosmos, challenging human perception. It highlights the intricate storytelling and breathtaking visuals that create an immersive experience, transporting audience
Summary #1 (chunk #202) - Preview: The review praises "Interstellar" as a visually stunning and emotionally gripping sci-fi epic that takes audiences on a breathtaking journey through space and time. It highlights Christopher Nolan's visionary direction, stellar performances, and the
Summary #2 (chunk #199) - Preview: The review praises "Interstellar" as a thought-provoking and visually mesmerizing sci-fi epic that delves into space, time, and the human condition. It highlights the film's stunning visuals, emotional depth, and philosophical pondering, creating a p
Summary #3 (chunk #137) - Preview: The review describes "Interstellar" as an ambitious and visually stunning film that takes viewers on a journey through space and time. It explores themes of love, sacrifice, and the resilience of the human spirit. The emotional resonance and stunning
Summary #4 (chunk #77) - Preview: The review praises "Interstellar" for its visually stunning and intellectually engaging qualities that immerse viewers in a world of wonder and mystery. It highlights the film's exploration of time, space, and the human condition as profound and deep
Summary #5 (chunk #78) - Preview: The review praises the movie "Interstellar" for being a powerful and emotionally gripping film that delves into themes of love, sacrifice, and the human spirit. It commends the visually stunning and intellectually stimulating aspects of the film. Add
Summary #6 (chunk #136) - Preview: The review of "Interstellar" praises the movie for its captivating mix of science fiction elements and emotional storytelling. It highlights the film's ability to engage viewers in contemplating the mysteries of the universe and the human spirit. The
Summary #7 (chunk #43) - Preview: The review praises "Interstellar" as a cinematic masterpiece that explores the mysteries of the universe while maintaining a strong human connection. It highlights Hans Zimmer's score as a perfect complement to the visuals, enhancing the immersive ex
Summary #8 (chunk #67) - Preview: The review praises "Interstellar" as a visually stunning and emotionally impactful film that breaks new ground in storytelling. It highlights the outstanding performances, particularly Anne Hathaway's powerful portrayal. Overall, the review suggests
Summary #9 (chunk #193) - Preview: The review praises "Interstellar" as a visually stunning and emotionally resonant sci-fi epic that explores themes of time, space, and human connection. It is described as intellectually stimulating and deeply moving, transcending the boundaries of c
Summary #10 (chunk #17) - Preview: The review of Christopher Nolan's Interstellar praises the movie for its breathtaking exploration of the universe and its blend of cosmic wonder with intimate human drama. The reviewer highlights the film's existential themes, stunning visuals, and p
5. Summarisation of summaries — Create a summary of summaries to get a final summary.
Let’s now create a final summary of summaries
from langchain.schema import Document
summaries = "\n".join(summary_list)
# Convert it back to a document
summaries = Document(page_content=summaries)
print (f"Your total summary has {llm.get_num_tokens(summaries.page_content)} tokens")
Output-
Your total summary has 755 tokens
llm4 = ChatOpenAI(temperature=0,
openai_api_key=userdata.get('OPENAI_KEY'),
max_tokens=3000,
model='gpt-4',
request_timeout=120
)
combine_prompt = """
You will be given a series of summaries of a movie. The summaries will be enclosed in triple backticks (```)
Your goal is to give a verbose summary of all reviews.
The reader should be able to grasp all intricacies of reviews and the summary should be representative of wholistic but diverse view of all reviews.
```{text}```
VERBOSE SUMMARY:
"""
combine_prompt_template = PromptTemplate(template=combine_prompt, input_variables=["text"])
reduce_chain = load_summarize_chain(llm=llm4,
chain_type="stuff",
prompt=combine_prompt_template,
verbose=True # Set this to true if you want to see the inner workings
)
output = reduce_chain.run([summaries])
print(output)
Output- Final summary
The reviews of "Interstellar" universally laud the film for its ambitious and groundbreaking exploration of space, time, and the human condition. The movie is consistently praised for its visually stunning and emotionally gripping narrative that takes audiences on a breathtaking journey through the cosmos. The intricate storytelling, coupled with the film's breathtaking visuals, creates an immersive experience that transports viewers into a world of wonder and mystery.
The film's director, Christopher Nolan, is commended for his visionary direction, which is seen as instrumental in creating a deeply moving and thought-provoking cinematic experience. The performances, particularly those of Matthew McConaughey and Anne Hathaway, are highlighted as stellar, contributing significantly to the emotional depth and resonance of the film.
The exploration of profound themes such as love, sacrifice, and the resilience of the human spirit is a recurring point of praise in the reviews. These themes, combined with the film's intellectual engagement, are seen as contributing to a cinematic experience that is both intellectually satisfying and emotionally impactful.
The film's ability to engage viewers in contemplating the mysteries of the universe and the human spirit is also highly praised. The reviews suggest that the movie leaves viewers in a state of wonder and contemplation even after it ends, resonating with them on a profound level.
The film's score, composed by Hans Zimmer, is highlighted as a perfect complement to the visuals, enhancing the immersive experience for viewers. The film's blend of science fiction elements with emotional storytelling is seen as a captivating mix that makes "Interstellar" a noteworthy addition to the genre.
Overall, "Interstellar" is described as a cinematic masterpiece that successfully combines stunning visuals with emotional depth, breaking new ground in storytelling. The film is seen as transcending the boundaries of conventional science fiction, making it a true cinematic gem that resonates long after it ends. The reviews suggest that the movie is a must-watch for its daring and imaginative storytelling, exceptional acting, and its ability to challenge human perception.
Conclusion
- Looking at the final summary generated, It seems pretty good.
- Cost should be reduced by the factor of number of clusters. Approximately by factor of 11 in this case.
- This algorithm/technique can used to summarise books, scripts, research papers and lot more.
- There is some loss of information but then show me a summary which doesn’t loose any information.
- Choosing closest vector from centroid of the cluster was a personal choice. You can choose different vectors depending on your specific use case.
Reference
Here is the link to Google colab notebook containing all code described above — https://colab.research.google.com/drive/1vk2RFMZJjCtXHOnXVRwfPgC9yKQYHrsZ?usp=sharing
Here is the Github link to notebook file — https://github.com/iamvishalkhare/cbvs/blob/main/Leveraging_Clustering_and_Vector_Selection_for_Cost_Effective_LLM_Summarisation_Tasks%20copy.ipynb