|| TrustGeekOne ||

AI/ML, Blockchain, Cybersecurity

Blog

Unstructured data to clean text for LLM

19th December 2024
There were lots of document processing libraries and services that converts various types of document types to clean text data that be documents, images, audio, video, etc. to text format that could be used with LLMs.

Some popular libraries are
- Textractor
- unstructured.io
- LlamaParse
- llms.txt
New opensource python library that converts document to clean text format to use with LLM.

MarkItDown – from Microsoft

A python library that transforms any (most) document to clean Markdown format. It supports huge number of types of data format like Pdf, Pptx, Docx, xlsx, images, audio, html, text formats(csv, json, xml), zip file, etc. The library has smart content recognition for document structures, media processing like images/audio, supports OCR, etc.

Installing MarkItDown for your project
```
$ pip install markitdown
```
Basic implementation
```
from markitdown import MarkItDown
markdown = MarkItDown()
content = markdown.convert('demo.docx')
print(content.text_content)
```
(more…)
Implementation of Differential Privacy (Part 1)

11th December 2024
There are various differentially private algorithms to support Differential Privacy.

Bounded Mean, Bounded Sum, Laplace Mechanism, Exponential Mechanism, Private Histogram, Secured Multi Party Computation (Secured MPC), Differentially Private SGD, etc.

Of these algorithms Secured MPC, is one that I have worked in multiple other projects within domain of cryptography and blockchain as well. For the interest of privacy within ML models we will go through Differentially Private SGD in detail as well. We will be using libraries to approach the implementation within our applications.

Lets go with few concepts on DP like Function Sensitivity, Privacy Loss, Privacy Budget, etc.

Function Sensitivity – signifies how sensitive a function is to changes. The max change in the output of function when single individual’s data in input dataset is altered.

Privacy Loss – extend to which addition or removal of single individual’s data in dataset can influence the output of the query or computation

Privacy Budget – ϵ | epsilon – total privacy spending allowed while maintaining acceptable privacy guarantees

Following Openmined from the very early days when it was founded, I was very fond of understanding how ML and blockchain comes together – Decentralized AI. Other companies like SingularityNet were there however Openmined was more interesting to me. Openmined combines Federated learning with Homomorphic Encryption HME and blockchain to enable collaborative model to implement machine learning application in decentralized fashion.

Libraries on DP:
- IBM Differential Privacy ( diffprivlib )
- Google Differential Privacy
- Openmined DP (pydp)
- Opendp dpcreator
We will go through multiple libraries for implementation of DP, starting with IBM Differential privacy – which I think is very straight forward and simple implementation.

IBM Differential Privacy

Installation
```
$ pip install diffprivlib
```
(more…)
Introduction to Differential Privacy

11th December 2024
Differential Privacy is a mathematical framework that allows analysis of aggregated data ensuring privacy of individual data. This method guarantees that single individual’s data is not revealed in anyway.

Giving this very basic definition of what differential privacy is, lets understand some interesting information on why Differential privacy.

Machine learning algorithms needs lots of data to train and develop a model that would be used in practical application. Most of the time the data has sensitive data sets, that contain private information of individuals, if not taken care could leak the sensitive information. There was a Netflix incident, about leakage of private information. Netflix provided a training dataset of user data in 2006, that consisted of 100 million individual movie ratings and data of the individual ratings for around 500k users. Some information here. The data were heavily ananoymized however researchers found a way to isolate individual users cross referencing the data and public data that could be found online.

So we are curating sensitive database with more senstive information like healthcare data, and we could release some statistics of the data to the public. Lets assume that we have a active dataset of people with their personal health data, and provide a easy API for public to call. Now if researchers could retrieve aggregated data like age and number of users from the dataset for their research. They publish the data they get in their article like reporting Number of users – Nt = 100 and avergae age of group – At = 30. While another researcher publishes another report with same dataset of number of user – Nt+1 = 101 and average age of group – At+1 = 30.03. Based on these two reports, we could derive the details of the individual user.

Time t ; Number of users, Nt = 100, Average age group, At = 30

Time t + 1; Number of users, Nt+1 = 101, Average age group, At+1 = 30.03

Now we could reverse engineer the age of the individual like

For t, = 100 x 30 = 3000

For t+1, = 101 x 30.03 = 3033.03

So age of the only person = 3033.03 – 3000 = ~33 year.

We can get private data of individual user from the data, in the similar way. Like for salary of people, bank balance of users, or even sensitive information from enterprise data.

These kind of attack on the anonimized data could be protected with Differential privacy algorithms, that rely on adding random noise to the data either in local or global datapoints. Doing this will definitely impact the overall result and skew the result to some non-sensical ones. DP is a study of making the balance between the privacy and utility of the result from the dataset.

Adding random values to the dataset however could result in breach of privacy eventually when large number of queries are done, however doesn’t fully secure the privacy.

There are few terms in DP that should be understood as well like Function Sensitivity, Privacy Loss, Epsilon-DP, Privacy Budget, etc.

We will go through the concept and try to mathematically define them with examples and implement them as well for our applications. We will also go through DP in ML.
KV and KV Caching in Transformers

4th December 2024
Self Attention is the important technique that makes Tranformers great. It allows the model to pay attention to different parts of the input sequence as it generates next token. It transform each input token into context vector, which combines the information from all the inputs in given text. KV caching is only present for decoder only models like GPT or decoder part of encoder-decoder models.

Context vector calculation involves three components – query, key and value.

We will look into details on how it is computed.

Lets first look into how we could load GPT2 in our system and then use that example to understand KV and KV Caching.

One interesting topic would be understanding the KV caching and understanding how privacy could impact it.

Self-attention is achieved by transforming input sequence into three vectors – query Q, Keys K and Values V. Attention mechanism calculated weighted sum of the values based on similarity between query and key vectors, which along with original input passed through feed forward NN produces the final result.

Loading GPT2 model
```
from transformers import GPT2Tokenizer, GPT2LMHeadModel
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')

text = "She eats apple"
encoded_input = tokenizer(text, return_tensors='pt')
output = model.generate(encoded_input['input_ids'], max_length=10, num_return_sequences=2, num_beams=2, do_sample=False, temperature=0.0, pad_token_id=tokenizer.eos_token_id)

for i, sample in enumerate(output):
    decoded_text = tokenizer.decode(sample, skip_special_tokens=True)
    print(f"Sample {i + 1}: {decoded_text}\n")
```
This will generate two different samples of predicted text from GPT2 model. Sample 1 and Sample 2.
```
Sample 1: She eats apple pie, but he's not a

Sample 2: She eats apple pie, and I'm not sure
```
We can see the next text generated after “She eats apple” is “pie” we will look into how it could be calculated.

The important part in the generation process that we need to focus on is temperature=0.0, which means it will generate the same thing every time without providing probablistic result each time ie. the result will be deterministic to say. There are other ways to do it as well setting the seed as a static value, however this could be easier for now.

To understand the working of how model generated next tokens, we will have to get into the workings of various layers in the model. To do so we will have to return the attentions weights for the input texts to the model. Since generate() function only focuses on using the model to generate next prediction of tokens and not return those attention weights, we will be focusing on the forward passes method directly of the model instead.
```
# Forward pass on model including attentions
output = model(**encoded_input, output_attentions=True)

# attention weight - one tensor per layer
attentions = output.attentions

# attention tensor shape: (batch_size, num_heads, seq_length, seq_length)
# For simplicity, let's print the first attention layer and the first head
for layer_num, layer_attention in enumerate(attentions):
    print(f"Layer {layer_num + 1} Attention:")
    # layer_attention is of shape (batch_size, num_heads, seq_length, seq_length)
    # Get the attention values for the first head
    attention_head = layer_attention[0, 0, :, :].detach().numpy()
    print(attention_head)
    print()
```
(more…)
Syncing bash history across sessions

5th September 2022
```
export PROMPT_COMMAND="history -a; history -n"
```
I always work with multiple sessions and have from 5-10 sessions. I work on one session and then switch to another session, I try to do Ctrl + R , however since the bash history are not synced, I need to go to that session and then copy the command.
The more frustrating encounter is when you remember the keywords of the statement but used it long back. You will surely lose the history, being lots of sessions un-synced.

The above command provides just that – sync bash history from multiple sessions of bash.
- PROMPT_COMMAND when a new command is executed, commands in this variable is executed. Commands in this variable is executed every time bash prompt is shown.
- history -a appends new command statement to bash history file
- history -n statement to read any unread lines from bash history file to current session