Natural Language Processing

NLP - a reality check

    1.
    ​A powerful benchmark, paper, medium - normalizing data sets allows us to see that there wasn't any advancement in terms of metrics in many NLP algorithms.

TOOLS

SPACY

    1.
    ​Vidhaya on spacy vs ner - tutorial + code on how to use spacy for pos, dep, ner, compared to nltk/corenlp (sner etc). The results reflect a global score not specific to LOC for example.
    2.
    The spaCy course​
    3.
    SPACY OPTIMIZATION - LP using CYTHON and SPACY.​
    4.
    ​

NLP embedding repositories

    1.
    ​Nlpl​

NLP DATASETS

    1.
    ​The bid bad 600, medium​

NLP Libraries

    3.
    ​Comparison spacy,nltk core nlp
    5.

Multilingual models

    1.
    ​Fb’s laser​
    2.
    ​Xlm, xlm-r​
    3.
    Google universal embedding space.

Augmenting text in NLP

    1.
    ​Synonyms, similar embedded words (w2v), back translation, contextualized word embeddings, text generation
    2.
    Yonatan hadar also has a medium post about this

TF-IDF

​TF-IDF - how important is a word to a document in a corpus
TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document).
Frequency of word in doc / all words in document (normalized bcz docs have diff sizes)
IDF(t) = log_e(Total number of documents / Number of documents with term t in it).
measures how important a term is
TF-IDF is TF*IDF
    3.
    ​Print top features​
Data sets:
    2.
    ​NLP embeddings​

Sparse textual content

    1.
    mean(IDF(i) * w2v word vectors (i)) with or without reducing PC1 from the whole w2 average (amir pupko)
def mean_weighted_embedding(model, words, idf=1.0):
if words:
return np.mean(idf * model[words], axis=0)a
else:
print('we have an empty list')
return []
idf_mapping = dict(zip(vectorizer.get_feature_names(), vectorizer.idf_))
logs_sequences_df['idf_vectors'] = logs_sequences_df.message.apply(lambda x: [idf_mapping[token] for token in splitter(x)])
logs_sequences_df['mean_weighted_idf_w2v'] = [mean_weighted_embedding(ft, splitter(logs_sequences_df['message'].iloc[i]), 1 / np.array(logs_sequences_df['idf_vectors'].iloc[i]).reshape(-1,1)) for i in range(logs_sequences_df.shape[0])]
    1.
    ​Multiply by TFIDF​
    2.
    Enriching using lstm-next word (char or word-wise)
    3.
    Using external wiktionary/pedia data for certain words, phrases
    4.
    Finding clusters of relevant data and figuring out if you can enrich based on the content of the clusters

Basic nlp

    3.
    ​Multiclass text classification with svm/nb/mean w2v/d2v - tutorial with code and notebook.
    5.
      1.
      Logistic regression with word ngrams
      2.
      Logistic regression with character ngrams
      3.
      Logistic regression with word and character ngrams
      4.
      Recurrent neural network (bidirectional GRU) without pre-trained embeddings
      5.
      Recurrent neural network (bidirectional GRU) with GloVe pre-trained embeddings
      6.
      Multi channel Convolutional Neural Network
      7.
      RNN (Bidirectional GRU) + CNN model
    6.

Chunking

NLP for hackers tutorials

    2.
    ​Complete guide for training your own Part-Of-Speech Tagger - using Penn Treebank tagset. Using nltk or stanford pos taggers, creating features from actual words (manual stemming, etc0 using the tags as labels, on a random forest, thus creating a classifier for POS on our own. Not entirely sure why we need to create a classifier from a β€œclassifier”.
    3.
    ​Word net introduction - POS, lemmatize, synon, antonym, hypernym, hyponym
    4.
    ​Sentence similarity using wordnet - using synonyms cumsum for comparison. Today replaced with w2v mean sentence similarity.
    5.
    ​Stemmers vs lemmatizers - stemmers are faster, lemmatizers are POS / dictionary based, slower, converting to base form.
    6.
    ​Chunking - shallow parsing, compared to deep, similar to NER
    7.
    ​NER - using nltk chunking as a labeller for a classifier, training one of our own. Using IOB features as well as others to create a new ner classifier which should be better than the original by using additional features. Aso uses a new english dataset GMB.
    11.
    ​Tf-idf​
    12.
    ​Nltk for beginners​
    13.
    ​Nlp corpora corpuses
    14.
    ​bow/bigrams​
    15.
    ​Textrank​
    16.
    ​Word cloud​
    18.
    19.
    ​POS using CRF​

Synonyms

    1.
    Python Module to get Meanings, Synonyms and what not for a given word using vocabulary (also a comparison against word net) https://vocabulary.readthedocs.io/en/…​
For a given word, using Vocabulary, you can get its
    Meaning
    Synonyms
    Antonyms
    Part of speech : whether the word is a noun, interjection or an adverb et el
    Translate : Translate a phrase from a source language to the desired language.
    Usage example : a quick example on how to use the word in a sentence
    Pronunciation
    Hyphenation : shows the particular stress points(if any)
Swiss army knife libraries
    1.
    ​textacy is a Python library for performing a variety of natural language processing (NLP) tasks, built on the high-performance spacy library. With the fundamentals β€” tokenization, part-of-speech tagging, dependency parsing, etc. β€” delegated to another library, textacy focuses on the tasks that come before and follow after.
Collocation
    1.
    What is collocation? - β€œthe habitual juxtaposition of a particular word with another word or words with a frequency greater than chance.”Medium tutorial, quite good, comparing freq/t-test/pmi/chi2 with github code
    2.
    A website dedicated to collocations, methods, references, metrics.
    4.
    ​Text2vec in R - has ideas on how to use collocations, for downstream tasks, LDA, W2V, etc. also explains about PMI and other metrics, note that gensim metric is unsupervised and probablistic.
    5.
    NLTK on collocations​
    6.
    A blog post about keeping or removing stopwords for collocation, usefull but no firm conclusion. Imo we should remove it before
    7.
    A blog post with code of using nltk-based collocation
    8.
    Small code for using nltk collocation​
    9.
    Another code / score example for nltk collocation​
    10.
    Jupyter notebook on manually finding collocation - not useful
    11.
    Paper: Ngram2Vec - Github We introduce ngrams into four representation methods. The experimental results demonstrate ngrams’ effectiveness for learning improved word representations. In addition, we find that the trained ngram embeddings are able to reflect their semantic meanings and syntactic patterns. To alleviate the costs brought by ngrams, we propose a novel way of building co-occurrence matrix, enabling the ngram-based models to run on cheap hardware
    12.
    Youtube on bigrams, collocation, mutual info and collocation​
Language detection
    1.
    ​Using google lang detect - 55 languages af, ar, bg, bn, ca, cs, cy, da, de, el, en, es, et, fa, fi, fr, gu, he, hi, hr, hu, id, it, ja, kn, ko, lt, lv, mk, ml, mr, ne, nl, no, pa, pl, pt, ro, ru, sk, sl, so, sq, sv, sw, ta, te, th, tl, tr, uk, ur, vi, zh-cn, zh-tw
Stemming
How to measure a stemmer?
    1.
    References [1 2(apr11) 3(Index compression factor ICF) 4 5]
Phrase modelling
    1.
    ​Phrase Modeling - using gensim and spacy
Phrase modeling is another approach to learning combinations of tokens that together represent meaningful multi-word concepts. We can develop phrase models by looping over the the words in our reviews and looking for words that co-occur (i.e., appear one after another) together much more frequently than you would expect them to by random chance. The formula our phrase models will use to determine whether two tokens AA and BB constitute a phrase is:
count(A B)βˆ’countmincount(A)βˆ—count(B)βˆ—N>threshold
    1.
    ​ SO on PE.​
    2.
    ​
Document classification
Hebrew NLP tools
    1.
    ​HebMorph last update 7y ago
    4.
    ​Hebrew-nlp service docs the features (morphological analysis, normalization etc), git​
Semantic roles:

ANNOTATION

    1.
    ​Snorkel - using weak supervision to create less noisy labelled datasets
      1.
      ​Git​
      2.
      ​Medium​
    2.
    ​Snorkel metal weak supervision for multi-task learning. Conversation, git​
      1.
      Yes, the Snorkel project has included work before on hierarchical labeling scenarios. The main papers detailing our results include the DEEM workshop paper you referenced (https://dl.acm.org/doi/abs/10.1145/3209889.3209898) and the more complete paper presented at AAAI (https://arxiv.org/abs/1810.02840). Before the Snorkel and Snorkel MeTaL projects were merged in Snorkel v0.9, the Snorkel MeTaL project included an interface for explicitly specifying hierarchies between tasks which was utilized by the label model and could be used to automatically compile a multi-task end model as well (demo here: https://github.com/HazyResearch/metal/blob/master/tutorials/Multitask.ipynb). That interface is not currently available in Snorkel v0.9 (no fundamental blockers; just hasn't been ported over yet).
      2.
      There are, however, still a number of ways to model such situations. One way is to treat each node in the hierarchy as a separate task and combine their probabilities post-hoc (e.g., P(credit-request) = P(billing) * P(credit-request | billing)). Another is to treat them as separate tasks and use a multi-task end model to implicitly learn how the predictions of some tasks should affect the predictions of others (e.g., the end model we use in the AAAI paper). A third option is to create a single task with all the leaf categories and modify the output space of the LFs you were considering for the higher nodes (the deeper your hierarchy is or the larger the number of classes, the less apppealing this is w/r/t to approaches 1 and 2).
    4.
    ​Mturk alternatives​
      1.
      2.
      ​Jobby​
      3.
      ​Shorttask​
      4.
      ​Samasource​
      5.
      Figure 8 - pricing - definite guide​
    7.
    ​Doccano - prodigy open source alternative butwith users management & statistics out of the box
    9.
    ​Loopr.ai - An AI powered semi-automated and automated annotation process for high quality data.object detection, analytics, nlp, active learning.
    16.
    ​Vader annotation​
      1.
      They must pass an english exam
      2.
      They get control questions to establish their reliability
      3.
      They get a few sentences over and over again to establish inter disagreement
      4.
      Two or more people get a overlapping sentences to establish disagreement
      5.
      5 judges for each sentence (makes 4 useless)
      6.
      They dont know each other
      7.
      Simple rules to follow
      8.
      Random selection of sentences
      9.
      Even classes
      10.
      No experts
      11.
      Measuring reliability kappa/the other kappa.
    17.
    ​Label studio ​
    ​
Ideas:
    1.
    Active learning for a group (or single) of annotators, we have to wait for all annotations to finish each big batch in order to retrain the model.
    2.
    Annotate a small group, automatic labelling using knn
    3.
    Find a nearest neighbor for out optimal set of keywords per β€œcategory,
    4.
    For a group of keywords, find their knn neighbors in w2v-space, alternatively find k clusters in w2v space that has those keywords. For a new word/mean sentence vector in the β€˜category’ find the minimal distance to the new cluster (either one of approaches) and this is new annotation.
    1.
    Myth One: One Truth Most data collection efforts assume that there is one correct interpretation for every input example.
    2.
    Myth Two: Disagreement Is Bad To increase the quality of annotation data, disagreement among the annotators should be avoided or reduced.
    3.
    Myth Three: Detailed Guidelines Help When specific cases continuously cause disagreement, more instructions are added to limit interpretations.
    4.
    Myth Four: One Is Enough Most annotated examples are evaluated by one person.
    5.
    Myth Five: Experts Are Better Human annotators with domain knowledge provide better annotated data.
    6.
    Myth Six: All Examples Are Created Equal The mathematics of using ground truth treats every example the same; either you match the correct result or not.
    7.
    Myth Seven: Once Done, Forever Valid Once human annotated data is collected for a task, it is used over and over with no update. New annotated data is not aligned with previous data.

​Crowd Sourcing ​

​
​
    Conclusions:
      Experts are the same as a crowd
      Costs a lot less $$.

Inter agreement

    1.
    Cohens kappa (two people)
but you can use it to map a group by calculating agreement for each pair
    5.
    ​Kappa and the relation with accuracy (redundant, % above chance, should not be used due to other reasons researched here)
The Kappa statistic varies from 0 to 1, where.
    0 = agreement equivalent to chance.
    0.1 – 0.20 = slight agreement.
    0.21 – 0.40 = fair agreement.
    0.41 – 0.60 = moderate agreement.
    0.61 – 0.80 = substantial agreement.
    0.81 – 0.99 = near perfect agreement
    1 = perfect agreement.
    1.
    Fleiss’ kappa, from 3 people and above.
Kappa ranges from 0 to 1, where:
    0 is no agreement (or agreement that you would expect to find by chance),
    1 is perfect agreement.
    Fleiss’s Kappa is an extension of Cohen’s kappa for three raters or more. In addition, the assumption with Cohen’s kappa is that your raters are deliberately chosen and fixed. With Fleiss’ kappa, the assumption is that your raters were chosen at random from a larger population.
    ​Kendall’s Tau is used when you have ranked data, like two people ordering 10 candidates from most preferred to least preferred.
    Krippendorff’s alpha is useful when you have multiple raters and multiple possible ratings.
    1.
    Krippendorfs alpha
    1.
    MACE - the new kid on the block. -
learns in an unsupervised fashion to
    1.
    a) identify which annotators are trustworthy and
    2.
    b) predict the correct underlying labels. We match performance of more complex state-of-the-art systems and perform well even under adversarial conditions
    3.
    ​MACE does exactly that. It tries to find out which annotators are more trustworthy and upweighs their answers.
    4.
    ​Git -
When evaluating redundant annotatio
ns (like those from Amazon's MechanicalTurk), we usually want to
    1.
    aggregate annotations to recover the most likely answer
    2.
    find out which annotators are trustworthy
    3.
    evaluate item and task difficulty
MACE solves all of these problems, by learning competence estimates for each annotators and computing the most likely answer based on those competences.
    1.
    ​
Calculating agreement
    1.
    Compare against researcher-ground-truth
    2.
    Self-agreement
    3.
    Inter-agreement
      1.
      ​Medium​
      2.
      ​Kappa cohen
      4.
      Github computer Fleiss Kappa 1​
      5.
      6.
      ​GWET AC1, paper: as an alternative to kappa, and why
Machine Vision annotation
    1.
    ​CVAT​
Troubling shooting agreement metrics
    1.
    3.
    ​Interpreting agreement, Accuracy precision kappa

CONVOLUTION NEURAL NETS (CNN)

    1.
    ​Cnn for text - tal perry
    2.
    ​1D CNN using KERAS​

KNOWLEDGE GRAPHS

    1.
    ​Automatic creation of KG using spacy and networx Knowledge graphs can be constructed automatically from text using part-of-speech and dependency parsing. The extraction of entity pairs from grammatical patterns is fast and scalable to large amounts of text using NLP library SpaCy.
    3.
    Medium Series:
      1.
      ​Creating kg​
      3.
      ​Semantic models​
    4.
    ​
    5.
    ​

SUMMARIZATION

    2.
    ​With nltk - words assigned weighted frequency, summed up in sentences and then selected based on the top K scored sentences.
    10.
    ​Very short intro​
    12.
    ​Unsupervised methods using sentence emebeddings (long and good) - using sent2vec, clustering, picking by rank

​

SENTIMENT ANALYSIS

Databases

    2.
    Movie reviews: IMDB reviews dataset on Kaggle​
    3.
    Sentiwordnet – mapping wordnet senses to a polarity model: SentiWordnet Site​

Tools

    1.
    ** Many Sentiment tools, ​
    4.
    Text BLob:
      2.
      ​Python code​
      3.
      ​More code​