Kavita Ganeshan

Llm For Generative Qa

Thu, 06 Apr 2023 00:00:00 +0000

LLM for Generative QA on Long Documents

Recently I was working on a problem statement that needed to explore LLM for Question Answering context. The problem was to use opensource LLM model (not Open AI gpt) and host in a local system for QA on custom data. I found very interesting set of articles and code repo that helped through this process. This is a fast moving topic and hence the code I wrote or the articles might get outdated in a weeks time. My LLM Code repo

This is an amazing article by Abonia Sojasingarayar
1. Author gives basic architecture for a QA system to process Long and verbose documents eg: Contracts, Legal agreements, clinical research papers etc.
2. Link to author’s codebase gives starter help on Langchain, Chroma. Though the code is a little outdated considering the speed in which Langchain community is updating the repo so fast to keep up with latest trends.Code
This article by Mick gives a good view on constraints of using LLM models directly for a Long document QA system Link
Matt Boegner talks about a generic knowledge retrieval architecture Link
Amazing blog by Langchain contributors. Talks about how they updated their framework for better retrieval by providing more classes and connectors. Link
Couple of Youtube Creators who I follow to get amazing code samples and explanations on Langchain and LLM in general are
1. 1littlecoder playlist
2. Sam Witteveen playlist
Interesting projects to track. There are more projects in each space. The above articles can give a varied list. The ones which I have used are listed down.
1. LLM framework - Langchain
2. Vector Stores - Chromadb, FAISS, Elastic Search, Milvus, Weaviate, Pinecone, Quadrant. Lot of traction I see is for Pinecone now. I saw a stack called OPL (Open AI, Pinecone and Langchain)
3. Connect data to LLMs - LLamaIndex
4. Models to try from HuggingFace - Google Flan T5 - XXL, Facebook Blenderbot 1B distill, Facebook Opt 66B, BigScience Bloom 560m
5. To demo a quick version - Gradio, Streamlit
6. To learn more prompt templates - James Briggs playlist

Summerschool Ml Nlp

Wed, 02 Jun 2021 00:00:00 +0000

I am so excited to share that I have been accepted to the LxMLS 2021 - Lisbon Machine learning School. I received their email today and have been in cloud 9 since. I have been following this summer school and few more since 5 years now. Finally I got an opportunity to attend. Considering this is during the pandemic time, it works well for me, since I get to attend it virtually and still be able to take care of my son and family. For someone like me who is a full time working mom, opportunities like these are rare and blessing in disguise. I thank the LxMLS folks to have accepted my application. They had asked a few questions related to my interests in NLP. NLP is all that I have worked in last few years, writing this application wasn’t very difficult because, it came from the heart. I am not a PhD in ML / NLP, but it was always a dream. So getting accepted to this school without a PhD gives me immense pleasure to meet peers of the community.

Paper Summary On Comparison Of Document

Tue, 01 Sep 2020 00:00:00 +0000

Summaries of literature survey on “Comparison of 2 Documents”

Paper Title : Unstructured document compliance checking
1. Entity: IBM China
2. Publication: IJCAI 2020
3. Solution:
4. It provides multi-level deep semantic comparison. 1. Sentence Level -> GNN based syntactic sentence encoder to capture semantic and syntactic clues of sentences.
  1. Dataset: Banking, open domain - SNLI 1. Clause Level -> Attention based semantic relatedness detection model to find relevant legal clauses.
  2. Dataset: No existing data, so annotated themselves
Paper Title: Extent of repetition in Contract language
1. Publication:ACL 2019
2. Problem: Measure the extent to which contract language in English is repetitive compared with the language of other English language corpora
3. results:
  1. far fewer rare word types and higher sentence similarity
  2. contracts are focused on creating arrangements between parties, similar to those created before
4. Dataset: EDGAR and google to extract contracts specific to categories
  1. Types: Prime contracts, Subcontracts, Non disclosure agreements, Purchase orders, Service agreements
Paper Title: Legal Document similarity using triples extracted from Unstructured text
1. Problem: Find similar legal documents
2. Solution:
  1. Created ontology for the selected dataset
  2. extract triples from documents for comparison
  3. similarity score is based on RDF path length from previous paper [ LODIFIER: generates linked data from unstructured text ]
3. Cited paper : Lodifier paper link: https://www.slideshare.net/isabelleaugenstein/lodifier-generating-linked-data-from-unstructured-text
4. Dataset: Income Tax Act which deals with ‘Changes in constitution, succession, and dissolution of firms and partnerships’. The Sections of the act which were of interest are Section 187, 188, 188A, 189 and 189A
Paper Title: Plagiarism– state of the art and evaluation
1. Problem: Survey of methods
2. Solution:
  1. Citation - characterizes a document by its citation sequence, rather than the text itself
    1. language independant
    2. reduces complexity compared to the comparison of full documents
  2. semantic similarity - X sim: similarity values between texts based on the statistical evaluation of word co-occurences
    1. reduce the amount of documents to compare through a prior title comparison
    2. compare within same category
  3. Semantic Role labelling - Similarity between sentences
    1. split sentences and assign their parts to different label groups
    2. semantic annotation of terms with the help of the semantic lexicon WordNet
  4. Latent Semantic Indexing - probabilistic model for words based on a SVD-reduced term-document matrix
    1. performance comparison of two stylometric features, namely “Average Sentence Length” and the “Honore Function”
  5. Cross Lingual Semantic Approach - semantically annotated graph model
    1. graph characterizes a document and similarity between graphs represents similarity between documents
  6. Regular cross lingual - combines lexical and syntactic features such as character-n-grams without taking semantic annotations into consideration
    1. translate a document before applying a monolingual detection approach.
  7. Language model - a tf-idf vector-based approached with a vector space model.
    1. It also takes type of plagiarism, by analyzing the length of a suspicious section with regard to summarized plagiarism
3. Dataset: PAN PC competition corpora
Paper Title: Retrieving plagiarized documents using semantic similarity
1. Solution:
  1. semantic similarity measure with a Nearest Neighbor (NN) search for content of document
  2. Modified X-Sim algorithm – a supervised pruning of the similarity values, and term weighting
  3. Multi class SVM based classifier for Titles
  4. use a) and c) for plagiarism index
2. Dataset: 20 Newsgroup corpus,LingSpam collection, the Reuters-21578 corpus, and the TDT
Paper Title: Detecting singleton review spammers using semantic similarity
1. Publication: WWW 2015
2. Problem: detecting fake reviews written by the same person using multiple names, posting each review under a different name
3. Solution:
  1. semantic similarity between words to the reviews level.
    1. wordnet + cosine similarity of POS tokens
  2. n topic modeling and exploits the similarity of the reviews topic distributions using two models:
    1. bag-of-words which included a restricted set of POSs
    2. bag-of-opinionphrases - splits a review into aspect-sentiment pairs, and these pairs are then used in the LDA model, instead of the document words.
4. Dataset: Yelp , Trustpilot, OTT
Paper Title: Toward validation of Textual information Retrieval techniques for software weakness
1. Publication: DEXA-2018
2. Problem: Map unstructured software vulnerability information to distinct software weaknesses ( security related mistake in software development)
3. Solution:
1. 5 similarity metrics based on Tf-Idf variant to create weakness matrix
2. similarities between the weakness matrix and vulnerabilities is with cosine similarity 1.Dataset : NVD
Other Approaches
1. Microsoft Deep Semantic Similarity Model - use siamese architectures and cosine similarity to rank query document pairs using sent2vec
2. simpler approach to document similarity is to concatenate the semantic representations of the two documents and train a binary classifier to determine whether two given documents are similar or dissimilar.
3. Vector based - Word embeddings and similarity
  1. Fast Embeddings, Word2vec, Glove, Elmo

Remote Nlp Python Troubleshooting

Fri, 21 Aug 2020 00:00:00 +0000

What I learnt when Remote engineer ran my NLP code and helped her troubleshoot.

Background
- As part of my work, I was asked to deliver a Context based search on few contract documents.
- I developed an Elastic Search based solution. Nothing fancy.
- I coded the application in python, got the logic working and with few business specific tweaks, I was able to reach close to 85%+ accuracy.
  1. Observation 1 : I wrote the code, wrote very basic sanity testcases, exception handling only in the main function.
    1. Modification 1 : Basic Test cases weren’t sufficient.
      1. Write validation module for your input with as much possibilities as you can think.
      2. In my case eg: check headers of the excel are in the same format as expected.
      3. Check the number of rows, columns, any extra row, extra column, missing column.
      4. If using pandas, empty values would be read as NAN, have validation rules for these.
      5. If using any server or database or 3rd party API or server, always have a healthcheck function for that. Add it in your logs.
    2. Modification 2 : Basic Exception Handling wasn’t clear enough to debug the error remotely
      1. Have exception handling atleast at every IO operation.
      2. In my case, eg: while reading / writing excel files in pandas.
      3. while connecting to Elastic Search engine.
      4. If empty value/ overflow value occurs in any necessary datastructure throught code.
      5. Always have a backup way to handle such exceptions:
        
        Choice 1: Any exception break the code
        
        Break only if this is critical and without which remaining code cannot run through
        
        Have a validation module at the beginning of the code to make sure everything is checked and then start the code.
        
        You could generate a health log for all the external interfacing libraries. This way it is easier to troubleshoot remotely
        
        Choice 2: If exception, is there a way that code can run through remaining input values.
        
        Track the successful and error inputs items through
        
        Inside the exception handling, update the error items and continue
    3. Modification 3: Log handling should be verbose and needed in such a way that you should be able to simply read and troubleshoot the issue remotely.
      1. TIP 1: Best to log summary statistics of everything that you do within a code.
        
        In my case eg: Log the total number of input files, number of rows /cols from excel, number of result matches from Elastic Search
        
        Number of Success / Failure is something very useful as mentioned in previous point.
        
        You could also log what is the approx intended time to finish (If it makes sense)
        
        Provide time based statistics: In my case eg: time to run end-end, time to run each row from excel, time to search in Elastic etc
  2. Observation 2 : Anticipate possible things that may fail
    1. Input
      1. Size: Input file sizes can go more than what you expected. What is the plan to handle it?
        
        In my case, I didnt anticipate and requested the remote help to break the number of rows in excel and create multiple sheets.
      2. Encoding: Assume to always expect few changes in encoding format of files if you are reading that in your code.
        
        Make sure you have a module to identify the encoding format and use that corresponding one.
    2. Output
      1. Size:
  3. Observation 3 : Externalize parameters as much as possible
    1. Externalize wherever you see a pattern. It helps.
    2. Tradeoff on how much config needs tweaking: So in case someone has to change configurations, they need not change at multiple places.
      1. In my case eg: any file path reading or related values like excel file path or input file path are all related to code home folder.
      2. TIP: Create folder structure for input, output, models, config, logs, src separately so it is easy to manage it configurations.
Share the codebase
- .pyc files -> to deploy code, I generated the compiled files. So original code isnt available for anyone to read and tweak [ Unless you are contributing to open source]
- Zip the code or create exe using pyinstaller -> I chose the former
Share the instructions for remote engineer to run your code
- TIP 1: Assume the remote person has very less tech knowledge. So your instructions should be very lucid and easy to follow
- TIP 2: Before running on client machine, replicate the same environment as far as possible and run the code.
  1. In my case eg: I replicated a conda environment. Provide an estimated time so remote person can understand how much time to wait and not think that installation isnt working.
- TIP 3: Create a good instructions file with screenshots for installations.
- TIP 4: Provide a good README file or API list with definitions eg:Swagger API for python
Compliance check for using Open source libraries on client environment
- TIP 5: Always check BSD / Apache / MIT etc license is agreeable for installing in client environment for commercial purpose or otherwise.
  1. It is also great to read about static and dynamic linking of libraries if you are interested to know more in depth. It will save time in license approvals.

Disclaimer: All the above are from a very specific project that I learnt. Hope this is helpful for others and hence shared here. Would be very happy to receive feedback / suggestions / alternative better ways.

Happy Remote troubleshooting!

Semantic Search

Fri, 03 Jul 2020 00:00:00 +0000

Interesting problems

Semantic Search in Unstructured documents

Important KEYWORD EXTRACTOR
Abbreviations expansion
Building dictionary / vocabulary
- single words
  1. Using our custom built word2vec using GENSIM + Spacy 1. spacy vector similarity
- learn phrases 1. link-1 1. link-2
Policy Number Extraction using BiLSTM-CRF
Providing reverse dictionary
- Github-link
Semantic Search
Text generation for suggesting keyword queries
Visual search engine
NBoost for elasticsearch
- Link-1
- Link-2
- Link-3

Links that were useful while working on Semantic Search

Here are some of the links that might be useful for semantic search engine related work.

Fuzzy query on elasticsearch
- query
Synonym query elasticsearch
- query
Semantic search using BERT
- BERT
- Siamese-Bert-Networks
Semantic Search using embeddings
- Simple stackoverflow Search Engine
Use embeddings with the Elastic index
- Text Similarity with Vectors in Elastic
Use Doc2Vec to find similar documents
- https://stackoverflow.com/questions/42781292/doc2vec-get-most-similar-documents
- https://kanoki.org/2019/03/07/sentence-similarity-in-python-using-doc2vec/
People I follow for this topic
- Goku Mohandas
- Pratik Bhavsar

Wair

Tue, 03 Mar 2020 00:00:00 +0000

What Am I Reading .. Today

3rd March

Neural Network Methods for Natural Language Processing by Yoav Goldberg

Learning

Beautiful explanation by Sudeshna Sarkar mam IIT KGP

Hypothesis space
Inductive Bias
1. Restriction
2. Preference
3. Example - Occam’s Razor - Simplest consistent hypothesis about target function is always the best
Generalization Error
1. Bias - errors due to incorrect assumptions or restriction on hypothesis space
2. Variance - model that you estimate from different training sets will differ from each other
Highlight
1. Learning is refining the hypothesis space
2. In a particular learning problem, you first defined the hypothesis space that is the class of function that you are going to consider then given the data points, you try to come up with the best hypothesis given the data that you have.
3. To Describe a Function, we have to decide the
  1. features of the vocabulary
  2. function class or type of function or language of the function

Leetcode 1011

Sat, 04 Jan 2020 00:00:00 +0000

Capacity To Ship Packages Within D Days

A conveyor belt has packages that must be shipped from one port to another within D days.

The i-th package on the conveyor belt has a weight of weights[i]. Each day, we load the ship with packages on the conveyor belt (in the order given by weights). We may not load more weight than the maximum weight capacity of the ship.

Return the least weight capacity of the ship that will result in all the packages on the conveyor belt being shipped within D days.

Example 1:

Input: weights = [1,2,3,4,5,6,7,8,9,10], D = 5 Output: 15 Explanation: A ship capacity of 15 is the minimum to ship all the packages in 5 days like this: 1st day: 1, 2, 3, 4, 5 2nd day: 6, 7 3rd day: 8 4th day: 9 5th day: 10

Note that the cargo must be shipped in the order given, so using a ship of capacity 14 and splitting the packages into parts like (2, 3, 4, 5), (1, 6, 7), (8), (9), (10) is not allowed. Example 2:

Input: weights = [3,2,2,4,1,4], D = 3 Output: 6 Explanation: A ship capacity of 6 is the minimum to ship all the packages in 3 days like this: 1st day: 3, 2 2nd day: 2, 4 3rd day: 1, 4 Example 3:

Input: weights = [1,2,3,1,1], D = 4 Output: 3 Explanation: 1st day: 1 2nd day: 2 3rd day: 3 4th day: 1, 1

Note:

1 <= D <= weights.length <= 50000 1 <= weights[i] <= 500

My pointers to remember:

1011:

We dont know the max weight

if we have to ship everything in one day (D=1), then we will be able to send at minimum sum(weights[i]) = 16 or if we send each item in weights then it will take len(weights) = D = 6 in which case the minimum capacity will be max(weights[i]) = 4 { Because if we have ship capacity less than max(weights), we wont be able to ship that weight )

So indirectly this gives us the range of ship capacity being 4 to 16 where D ranges from 6 to 1 But we have D given as input ie D=3 and the order of weights is to be maintained while filling up.

This clearly shows that we can use binary search to find the capacity of the ship
We also need to partition the array in such a way that D=3 is maintained

https://leetcode.com/problems/capacity-to-ship-packages-within-d-days/discuss/256737/C++-Binary-Search/267064

So binary search + partition

Final Solution: https://leetcode.com/problems/capacity-to-ship-packages-within-d-days/discuss/256765/Python-Binary-search-with-detailed-explanation

To get to exactly D days and minimize the max sum of any partition, we do binary search in the sum space which is bounded by [max(a), sum(a)]

https://leetcode.com/problems/capacity-to-ship-packages-within-d-days/discuss/256729/JavaC++Python-Binary-Search/249951

Time complexity: O(n * logSIZE), where SIZE is the size of the search space (sum of weights - max weight). Space complexity: O(1)

My NLP Curriculum

Mon, 06 May 2019 00:00:00 +0000

Data Science, Machine Learning, Deep Learning are the new buzz words recently.

There are lot of MOOCs and online courses / certifications available for these topics as well. I have always worked on Text and wanted to enroll in such course with Natural Language Processing in focus. Having gone through the syllabus and contens of few courses online, I felt the need to create a curriculum of my own. This is because the online material available is all disperse and I ended up going away from the course to gather my knowledge. So this is my attempt in educating / updating my NLP knowledge. Feel free to use this and modify it to your needs.

Major Topics that I need to work on:

Programming - Python, PyTorch
Math - Linear Algebra, Probability, Statistics
NLP - Linguistics, Statistical NLP, Deep NLP
Data Science - Pandas
MOOC - Andrew Ng, Stanford NLP, Oxford Deep Mind Lectures, EdX, Fast.ai Reference - Jason Brownlee, Dan Jurafsky Speech & Language Processing book, Cracking the coding interview

I am going to give myself about 8 months to finish my own curriculum and the test is I come up with my own Project implementation of something interesting in NLP using Deep learning (more like a thesis if possible). I will grade myself and I must say I am my worst critic. So trust me, this is a difficult assignment!

I will be updating this list as and when I find something new to add to the list.

Python:
1. Generators
2. Vectorization
3. Data Structures
  1. Numpy structures and their implementation
  2. Scikit Learn structures and their implementation
4. Algorithms
  1. Indexing and searching in dictionary in python
  2. Interview cake implementation for best time and space complexity
5. Matrix assignment
Linear Algebra:
1. Matrix Vectors
2. Tensors
Probability:
1. Conditional Probability - Heads Tails - Questions for interviews
Statistics:
1. Definitions, Metrics
2. correlation
3. Distributions
Statistical NLP:
1. Vectorizer / Transformer - Scikit Learn
2. HMM
3. CRF
4. Topic Modelling - LDA
5. Sequence labelling
6. Feature selection
7. Dimensionality reduction - PCA, ICA
Deep NLP:
1. Word2vec - CBOW/ Skip gram
2. Representation Learning
3. CNN for text
4. RNN for text
5. Attention model for text
6. Pytorch for text
7. BERT / Transformers - hugging face
Linguistics:
1. Discourse Segmentation
Pandas:
1. Dataframe manipulations
MOOC:
1. Andrew Ng - CNN
2. Stanford NLP - Richard Socher
3. Oxford
4. Fast.ai - Rachel Thomas

Foundations Of Statistical Nlp

Fri, 21 Sep 2018 00:00:00 +0000

Notes from the book “Foundations of Statistical Natural Language Processing” By Manning and Schutze

I thought of sharing my notes from this classical book of NLP. I really enjoy the examples, quotes and narration used in this book. It takes you through the absolute basics of probability and linguistics, before entering into complex modelling for language.

Preliminaries

Questions relevant to Linguistics
- What kind of things do people say?
- What do these things say/ask/request about world?
Lexical resources
- Brown Corpus (American English)
- Lancaster Oslo Bergen (British English)
- Susanne Corpus (130000 subset of Brown)
- Penn Treebank (Wall Street Journal articles)
- Canadian Hansards (Canadian Parliament Proceedings - Bilingual Corpus)
- Wordnet ( Dictionary, Hierarchy of synset of words, meronymy- part:whole relations)
Zipf law ( Principle of least effort )
- f.r=k { f: frequency, r: rank (position in list), k:constant}
- Number of meanings of word m \alpha \sqrt{f}
Collocation
- Phrasal verbs, compound nouns, idioms
- frequent bigrams + particular pos pattern ( this has noise like “next year”)
Concordance
- KWIC - Keyword in Context
- Verb frames

Wair

Thu, 20 Sep 2018 00:00:00 +0000

What Am I Reading .. Today

20th September

An beginners article on how to read research papers and to follow the topic in general.

Couple of takeaways from this article were

Last week

I wanted to try some of the resources mentioned in this article from kdnuggets I must say the videos from Dr.Jon Krohn were absolutely wonderful.

They start with basic details without the need for a lot of pre-requisites.
The transition through 3 part videos series is very fluid. Deep learning to Natural language processing and then to the more sophisticated GANs and Re-inforcement Learning. If you can’t access the videos, then jupyter notebook compilations are very useful as well. I am eagerly awaiting his book which is going to come out in December this year.