Category: Python

  • Unstructured data to clean text for LLM

    Unstructured data to clean text for LLM

    There were lots of document processing libraries and services that converts various types of document types to clean text data that be documents, images, audio, video, etc. to text format that could be used with LLMs.

    Some popular libraries are

    • Textractor
    • unstructured.io
    • LlamaParse
    • llms.txt

    New opensource python library that converts document to clean text format to use with LLM.

    MarkItDown – from Microsoft

    A python library that transforms any (most) document to clean Markdown format. It supports huge number of types of data format like Pdf, Pptx, Docx, xlsx, images, audio, html, text formats(csv, json, xml), zip file, etc. The library has smart content recognition for document structures, media processing like images/audio, supports OCR, etc.

    Installing MarkItDown for your project

    $ pip install markitdown

    Basic implementation

    from markitdown import MarkItDown
    markdown = MarkItDown()
    content = markdown.convert('demo.docx')
    print(content.text_content)
    (more…)
  • KV and KV Caching in Transformers

    KV and KV Caching in Transformers

    Self Attention is the important technique that makes Tranformers great. It allows the model to pay attention to different parts of the input sequence as it generates next token. It transform each input token into context vector, which combines the information from all the inputs in given text. KV caching is only present for decoder only models like GPT or decoder part of encoder-decoder models.

    Context vector calculation involves three components – query, key and value.

    We will look into details on how it is computed.

    Lets first look into how we could load GPT2 in our system and then use that example to understand KV and KV Caching.

    One interesting topic would be understanding the KV caching and understanding how privacy could impact it.

    Self-attention is achieved by transforming input sequence into three vectors – query Q, Keys K and Values V. Attention mechanism calculated weighted sum of the values based on similarity between query and key vectors, which along with original input passed through feed forward NN produces the final result.

    Loading GPT2 model
    from transformers import GPT2Tokenizer, GPT2LMHeadModel
    tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
    model = GPT2LMHeadModel.from_pretrained('gpt2')
    
    text = "She eats apple"
    encoded_input = tokenizer(text, return_tensors='pt')
    output = model.generate(encoded_input['input_ids'], max_length=10, num_return_sequences=2, num_beams=2, do_sample=False, temperature=0.0, pad_token_id=tokenizer.eos_token_id)
    
    for i, sample in enumerate(output):
        decoded_text = tokenizer.decode(sample, skip_special_tokens=True)
        print(f"Sample {i + 1}: {decoded_text}\n")

    This will generate two different samples of predicted text from GPT2 model. Sample 1 and Sample 2.

    Sample 1: She eats apple pie, but he's not a
    
    Sample 2: She eats apple pie, and I'm not sure

    We can see the next text generated after “She eats apple” is “pie” we will look into how it could be calculated.

    The important part in the generation process that we need to focus on is temperature=0.0, which means it will generate the same thing every time without providing probablistic result each time ie. the result will be deterministic to say. There are other ways to do it as well setting the seed as a static value, however this could be easier for now.

    To understand the working of how model generated next tokens, we will have to get into the workings of various layers in the model. To do so we will have to return the attentions weights for the input texts to the model. Since generate() function only focuses on using the model to generate next prediction of tokens and not return those attention weights, we will be focusing on the forward passes method directly of the model instead.

    # Forward pass on model including attentions
    output = model(**encoded_input, output_attentions=True)
    
    # attention weight - one tensor per layer
    attentions = output.attentions
    
    # attention tensor shape: (batch_size, num_heads, seq_length, seq_length)
    # For simplicity, let's print the first attention layer and the first head
    for layer_num, layer_attention in enumerate(attentions):
        print(f"Layer {layer_num + 1} Attention:")
        # layer_attention is of shape (batch_size, num_heads, seq_length, seq_length)
        # Get the attention values for the first head
        attention_head = layer_attention[0, 0, :, :].detach().numpy()
        print(attention_head)
        print()
    (more…)
  • Saving trained model

    Saving trained model

    In most cases, for training the model with the dataset we have is very time consuming and also processing hungry job which is costly task. To test in our development environment we have to do is test the trained model or use the trained model for in production without going for multiple training.

    If you have done some ML project you would have understood, how time and processor consuming task it is even when done in GPUs. For a application to use the model and train each time the application runs is unacceptable, so we can save the current trained state of the model for later use without of retraining the model on the same dataset again and again.

    We can accomplish this in python using some packages like

    • Pickle (Python Object Serialization Library)
    • Joblib (One of the Scikit-learn Method)

    Pickle?

    You might have heard this term somewhere when you go though ML articles or doing projects. This library is popular for Serialization(Pickling) and Marshalling (Unpickling). Pickling is the process of converting any Python object into a stream of bytes in hierarchy.Unpickling is process of converting the pickled stream of bytes to original python object following the object hierarchy.

    Example:

    Serialization (Pickling)

    import pickle
    
    pickle_file = 'string_list_pickle.pkl'
    names = ['apple', 'ball', 'cat']
    
    store_pickle = open(pickle_file, 'wb')
    pickle.dump(names, store_pickle)
    store_pickle.close()
    (more…)
  • Simple Neural Network for absolute beginners

    This is my first post regarding artificial intelligence (AI). But I promise to include as much as I can from understanding Simple Neural Network (NN) to deep learning through little theoretical but lots of practical implementations. I will also include simple projects where possible. Lets begin building then.

    I like Python a lot so most of the works will be done in Python. Later on I am hoping to develop them in Java and Scala.

    To learn NN we will not be using any NN libraries but some mathematical libraries, ie. numpy.

    Learn basics of Numpy HERE .

    To begin building NN which is supposed to mimic how our brain works, we have to understand little bit of our own Brain.

    An averaged sized Brain includes of 100 billion neurons connected by synapses. Neurons are the basic unit of brain which plays major role for all the tasks done by brain.  Blah blah blah… its better you go through this well written article (A Basic Introduction To Neural Networks).

    This tutorial we will be building a Artificial unit of this very Neuron. I consider you know matrices which we will be using as mathematical foundation for building NN with numpy.

    Our simple ANN will include three inputs and a output. (Input: 3, Output: 1). This neuron we will build should classify a basic problem of classification. We will use various different training algorithms to train our neuron for classification.

    So our neuron will have very small dataset for training (a deeplearning model will need very very large dataset for better performance) which will be enough in this problem.

    Example#Input AInput BInput COutput Y
    10010
    21111
    31011
    40110
    5100??? (1 expected)

    So what will be the output for the last data row (Row #5)?.

    (more…)
  • Bloom filter – Python

    A Bloom Filter is a space-efficient probabilistic data structure, to test whether an element is a member of a set. This algorithm for data representation ensures the element is in the dataset or not. This may have false positive matches but not false negative result.

    The most common use of Bloom Filter Algorithm is to see if an elements is in the disk before performing any operations. This reduces the I/O for lookups dramatically over large datasets.

    Consider we have to check a email address to search from large dataset of emails of millions. Searching all the emails in memory definitely is very inefficient and takes lots of time. We can create bloom filter bit array, which is very small compared to the original dataset and has almost same required result. This is what used in Yahoo Mail. When you log into Yahoo mail, the browser page requests a bloom filter representing your contact list from Yahoo servers. The Bloom Filter is compact and easily fits in your browser cache. In Yahoo, email is verified if it is in existed in your contact list.

    Another scenario, consider we have to get unique counts in the dataset, then we can use bloom filter to test if certain pattern or element has already been seen or not in the dataset. Of course this creates some false positives, but this might be much efficient than to compare everything from the memory.

    Apache HBase uses bloom filter to boost read speed by filtering out unnecessary disk reads of HFile blocks which do not contain a particular row or column.

    Quora implemented a sharded bloom filter in the feed backend to filter out stories that people have seen before. It is much faster and more memory efficient than previous solutions (Redis, Tokyo Cabinet and DB) and saves hundreds of ms on certain types of requests.(Tao Xu, Engineer at Quora)

    Transactional Memory (TM) has recently applied Bloom filters to detect memory access conflicts among threads.(Mark Jeffrey, modeled Bloom filters for concurrency conflict detection).

    Not to mention Facebook ( Typeahead Search Tech Talk (6/15/2010) ),, LinkedIn ( Cleo: the open source technology behind LinkedIn’s typeahead search | LinkedIn Engineering),  Bit.ly (dablooms – an open source, scalable, counting bloom filter library ) have implemented their own Bloom Filter.

    More examples ? Go here https://en.wikipedia.org/wiki/Bloom_filter#Examples

    Ok then, enough of the usage of Bloom Filter. We will be developing our own for searching email address from a list in Python.

    (more…)