Tag: code

  • Unstructured data to clean text for LLM

    Unstructured data to clean text for LLM

    There were lots of document processing libraries and services that converts various types of document types to clean text data that be documents, images, audio, video, etc. to text format that could be used with LLMs.

    Some popular libraries are

    • Textractor
    • unstructured.io
    • LlamaParse
    • llms.txt

    New opensource python library that converts document to clean text format to use with LLM.

    MarkItDown – from Microsoft

    A python library that transforms any (most) document to clean Markdown format. It supports huge number of types of data format like Pdf, Pptx, Docx, xlsx, images, audio, html, text formats(csv, json, xml), zip file, etc. The library has smart content recognition for document structures, media processing like images/audio, supports OCR, etc.

    Installing MarkItDown for your project

    $ pip install markitdown

    Basic implementation

    from markitdown import MarkItDown
    markdown = MarkItDown()
    content = markdown.convert('demo.docx')
    print(content.text_content)
    (more…)
  • Implementation of Differential Privacy (Part 1)

    Implementation of Differential Privacy (Part 1)

    There are various differentially private algorithms to support Differential Privacy.

    Bounded Mean, Bounded Sum, Laplace Mechanism, Exponential Mechanism, Private Histogram, Secured Multi Party Computation (Secured MPC), Differentially Private SGD, etc.

    Of these algorithms Secured MPC, is one that I have worked in multiple other projects within domain of cryptography and blockchain as well. For the interest of privacy within ML models we will go through Differentially Private SGD in detail as well. We will be using libraries to approach the implementation within our applications.

    Lets go with few concepts on DP like Function Sensitivity, Privacy Loss, Privacy Budget, etc.

    Function Sensitivity – signifies how sensitive a function is to changes. The max change in the output of function when single individual’s data in input dataset is altered.

    Privacy Loss – extend to which addition or removal of single individual’s data in dataset can influence the output of the query or computation

    Privacy Budget – ϵ | epsilon – total privacy spending allowed while maintaining acceptable privacy guarantees

    Following Openmined from the very early days when it was founded, I was very fond of understanding how ML and blockchain comes together – Decentralized AI. Other companies like SingularityNet were there however Openmined was more interesting to me. Openmined combines Federated learning with Homomorphic Encryption HME and blockchain to enable collaborative model to implement machine learning application in decentralized fashion.

    Libraries on DP:

    We will go through multiple libraries for implementation of DP, starting with IBM Differential privacy – which I think is very straight forward and simple implementation.

    IBM Differential Privacy

    Installation

    $ pip install diffprivlib
    (more…)
  • KV and KV Caching in Transformers

    KV and KV Caching in Transformers

    Self Attention is the important technique that makes Tranformers great. It allows the model to pay attention to different parts of the input sequence as it generates next token. It transform each input token into context vector, which combines the information from all the inputs in given text. KV caching is only present for decoder only models like GPT or decoder part of encoder-decoder models.

    Context vector calculation involves three components – query, key and value.

    We will look into details on how it is computed.

    Lets first look into how we could load GPT2 in our system and then use that example to understand KV and KV Caching.

    One interesting topic would be understanding the KV caching and understanding how privacy could impact it.

    Self-attention is achieved by transforming input sequence into three vectors – query Q, Keys K and Values V. Attention mechanism calculated weighted sum of the values based on similarity between query and key vectors, which along with original input passed through feed forward NN produces the final result.

    Loading GPT2 model
    from transformers import GPT2Tokenizer, GPT2LMHeadModel
    tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
    model = GPT2LMHeadModel.from_pretrained('gpt2')
    
    text = "She eats apple"
    encoded_input = tokenizer(text, return_tensors='pt')
    output = model.generate(encoded_input['input_ids'], max_length=10, num_return_sequences=2, num_beams=2, do_sample=False, temperature=0.0, pad_token_id=tokenizer.eos_token_id)
    
    for i, sample in enumerate(output):
        decoded_text = tokenizer.decode(sample, skip_special_tokens=True)
        print(f"Sample {i + 1}: {decoded_text}\n")

    This will generate two different samples of predicted text from GPT2 model. Sample 1 and Sample 2.

    Sample 1: She eats apple pie, but he's not a
    
    Sample 2: She eats apple pie, and I'm not sure

    We can see the next text generated after “She eats apple” is “pie” we will look into how it could be calculated.

    The important part in the generation process that we need to focus on is temperature=0.0, which means it will generate the same thing every time without providing probablistic result each time ie. the result will be deterministic to say. There are other ways to do it as well setting the seed as a static value, however this could be easier for now.

    To understand the working of how model generated next tokens, we will have to get into the workings of various layers in the model. To do so we will have to return the attentions weights for the input texts to the model. Since generate() function only focuses on using the model to generate next prediction of tokens and not return those attention weights, we will be focusing on the forward passes method directly of the model instead.

    # Forward pass on model including attentions
    output = model(**encoded_input, output_attentions=True)
    
    # attention weight - one tensor per layer
    attentions = output.attentions
    
    # attention tensor shape: (batch_size, num_heads, seq_length, seq_length)
    # For simplicity, let's print the first attention layer and the first head
    for layer_num, layer_attention in enumerate(attentions):
        print(f"Layer {layer_num + 1} Attention:")
        # layer_attention is of shape (batch_size, num_heads, seq_length, seq_length)
        # Get the attention values for the first head
        attention_head = layer_attention[0, 0, :, :].detach().numpy()
        print(attention_head)
        print()
    (more…)
  • Syncing bash history across sessions

    Syncing bash history across sessions

    export PROMPT_COMMAND="history -a; history -n"

    I always work with multiple sessions and have from 5-10 sessions. I work on one session and then switch to another session, I try to do Ctrl + R , however since the bash history are not synced, I need to go to that session and then copy the command.
    The more frustrating encounter is when you remember the keywords of the statement but used it long back. You will surely lose the history, being lots of sessions un-synced.

    The above command provides just that – sync bash history from multiple sessions of bash.

    • PROMPT_COMMAND when a new command is executed, commands in this variable is executed. Commands in this variable is executed every time bash prompt is shown.
    • history -a appends new command statement to bash history file
    • history -n statement to read any unread lines from bash history file to current session

  • Rust : Important Concepts

    Void Functions vs Diverging Functions

    There are two kinds of functions that doesn’t return any value

    #1.  fn void_function() {} 
    
    #2.  fn diverging_function() -> ! {}

    #1 A Void function doesn’t have any any return doens’t mean it doesn’t return any value, it returns () type. This is equivalent to void in C/C++.

    fn void_function() -> () {}

    #2 A diverging function is guarenteed to not return anything ie. println!("sample text");

    Example of diverging functions is

    fn diverging_function() -> ! {
       panic!()
    }
    
    fn diverging_function_2() -> ! {
        loop{}
    }

    Comparing with other languages, its similar to returning Exception in other languages. In languages like Java, even if the function return some data type, and due to exception it returns some Exception. The data type is different from what is output, since java doens’t have a way to mark it as actual method as diverging. Instead, Rust has special construct to specify nothing is returning ie. specify the function as diverging, which has similar effect.

    public String getString(String val){
         if (val == null){
               throw Exception();
         }
         return new String(val);
    }



    Idea for next blog: Stack Unwinding