August 29, 2023

Step-by-Step Guide to Deploying a Language Model (LLMs) on Your Computer

Home
/
Machine Learning (ML)
/
Step-by-Step Guide to Deploying...

BakingAI

Reading time

minutes

Large language models (LLMs) are a type of machine learning model that can be used for a wide range of natural language processing (NLP) tasks. In the past, LLMs were too large to be used on local computers, but recent advances in technology have made it possible to run them locally. This means that developers and researchers can now experiment with LLMs without having to worry about the computing resources required.

There are a number of benefits to using LLMs on local computers. First, it allows for more flexibility and control over the model. Developers can choose the specific LLM that they want to use, and they can also fine-tune the model to their specific needs. Second, it can be more cost-effective. Running an LLM on a local computer can be much cheaper than using a cloud-based service. Third, it can be more secure. Developers can keep their data private by running it on their own computers.

There are also some challenges to using LLMs on local computers. First, it can require a lot of computing resources. LLMs can be very large and complex, so they require a lot of memory and processing power. Second, it can be difficult to install and configure LLMs. LLMs are often complex and require a lot of specialized knowledge to install and use.

Overall, the benefits of using LLMs on local computers outweigh the challenges. LLMs are a powerful tool for NLP, and the ability to run them locally gives developers and researchers more flexibility, control, and cost-effectiveness.

Here are some specific examples of how LLMs can be used for NLP tasks on local computers:

Text generation: LLMs can be used to generate text, such as news articles, blog posts, and creative writing.
Question answering: LLMs can be used to answer questions about text, such as “What is the capital of France?”
Sentiment analysis: LLMs can be used to determine the sentiment of text, such as whether it is positive, negative, or neutral.
Machine translation: LLMs can be used to translate text from one language to another.
Information retrieval: LLMs can be used to find information in text, such as documents, websites, and social media posts.

These are just a few examples of the many ways that LLMs can be used for NLP tasks on local computers. As LLMs continue to develop, they will become even more powerful and versatile tools for NLP.

To get started with an LLM for your own use case, we will be using the quantized Llama-2 model. First, we need to install the necessary libraries:

from transformers import AutoTokenizer, pipeline, logging

from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig

Now we will mention the name of the model we are going to use
It’s open source llm available on hugging face (link – https://huggingface.co/TheBloke/Llama-2-13B-chat-GPTQ)

model_name_or_path = “TheBloke/Llama-2-13B-chat-GPTQ”

We will define a variable called tokenizer and initialise it with the AutoTokenizer class. The AutoTokenizer class is used to tokenize text, which means breaking it down into words and punctuation marks. The from_pretrained() method of the AutoTokenizer class loads a pre-trained tokenizer from a model name or path. The use_fast argument is set to True to indicate that the tokenizer should use a faster, but less accurate, tokenizer.

Variable naming model and initialises it with the AutoGPTQForCausalLM class. The AutoGPTQForCausalLM class is a pre-trained language model that can be used for a variety of tasks, such as text generation, question answering, and summarization. The from_quantized() method of the AutoGPTQForCausalLM class loads a quantized model from a model name or path. The model_basename argument is used to specify the name of the model file. The use_safetensors argument is set to True to indicate that the model should use safe tensors, which are tensors that are less likely to cause errors. The trust_remote_code argument is set to True to indicate that the code should trust the remote code that is used to load the model. The device argument is set to cuda:0 to indicate that the model should be loaded on the GPU with index 0. The use_triton argument is set to the value of the use_triton variable, which is False in this case. The quantize_config argument is set to None, which indicates that the model should not be quantized.

use_triton = False

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)

model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,

use_safetensors=True,

trust_remote_code=True,

device=”cuda:0″,

use_triton=use_triton,

quantize_config=None)

The generate_text() function takes a prompt as input and generates text based on the prompt. The function first creates a prompt template that surrounds the prompt with special tokens. The special tokens tell the model that the prompt is a system prompt and that the text that follows should be generated.

The function then converts the prompt template into a sequence of token IDs using the tokenizer() function. The tokenizer() function is a pre-trained model that maps text to token IDs.

The function then passes the token IDs to the model.generate() function. The model.generate() function generates a sequence of new token IDs based on the input token IDs. The temperature and max_new_tokens parameters control the randomness of the generated text. The function finally decodes the generated token IDs into text using the tokenizer.decode() function.

def generate_text(prompt):

prompt_template = f”'[INST] <<SYS>>

You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don’t know the answer to a question, please don’t share false information.

<</SYS>>

{prompt}[/INST]”’

input_ids = tokenizer(prompt_template, return_tensors=’pt’).input_ids.cuda()

output = model.generate(inputs=input_ids, temperature=0.7, max_new_tokens=512)

logging.set_verbosity(logging.CRITICAL)

return tokenizer.decode(output[0])

The code first opens the input CSV file and reads its contents into a list of rows. Each row in the list contains an abstract from a research paper.

The code then opens the output CSV file and creates a writer object for it. The writer object will be used to write the generated text to the output file.

The code then loops over the rows in the input list. For each row, the code gets the abstract and passes it to a function called generate_text(). The generate_text() function generates a new piece of text based on the abstract.

The code then writes the abstract and the generated text to the output file.

def main():

with open(‘input_data..csv’, ‘r’) as csvfile:

reader = csv.reader(csvfile, delimiter=’,’)

with open(output file path, ‘w’, newline=”) as csvout:

writer = csv.writer(csvout, delimiter=’,’)

for row in reader:

prompt = row[‘Abstract’]

output = generate_text(prompt)

writer.writerow([prompt, output])

if __name__ == ‘__main__’:

main()