This blog compiles some points that might be encountered in ML / NLP interviews. Good luck (for myself and to all readers finding jobs).

- Statistics
- Probability
- Causality, Fair ML
- Models
- Neural Network
- Training
- Natural Languages Processing specific
- NLP Basics
- Information theory
- Embedding
- Parsing

## Statistics

- Evaluating a model. TP / FP / TN / FN
- Type 1 (false positive) and type 2 errors (false negative).
- Precision (TP / (TP + FP)) vs. Recall (TP / (TP + FN)).
- Sensitivity / specificity.
- F1 score definitions. Micro / macro / weighted.
- ROC: receiver operating characteristic) curve. True positive rate vs. false positive rate at various threshold settings.
- AUC: area under (the ROC) curve.
- Bias-variance tradeoff. How to decide whether the model is overfitting, looking from learning plot?
- High bias: overfit. Solve with more training.
- High variance: too flexible; low generalizability. Solve with (1) regularization (2) bagging algorithm (3) using less and more simple features.

- LOOCV, k-fold. They are designed to counteract the bias!

## Probability

- Bayes Rule: posterior = likelihood * prior / evidence.
- How to compute evidence? Sum them up / integral / etc.
- How to compute posterior when it is almost impossible to do the above steps? Variational inference (to maximize the variational lower bound) or Monte Carlo sampling (e.g., Metropolis-Hastings algorithm, Gibbs sampling).
- Probability distribution functions (pdf).
- Exponential family distribution functions. Conjugate priors (Beta prior + binomial distribution; Dirichlet prior + multinomial distribution)
- Mutual information
- Conditional independence $P(X|Y) = P(Y)$
- Bayesian Network, d-separation.
- Variable elimination.

## Causality and Fair Learning

- Randomized control trials.
- Rubin-Neyman causal model.
- Traditional deconfounding approaches: residualization, inverse probability weighting.
- Fair Learning: unfairness metrics (demographics parity, equalized opportunity)
- Direct optimize unfairness: (Zemel et al., 2013)
- Adversarial components: (Madras et al., 2018)

## Models

- Supervised learning models: SVM, Random Forest, Gaussian Process, LogReg, Adaboost, Naive Bayes.
- Unsupervised: kNN, clustering, domain adaptation, self-supervised embedding.
- Visualization: PCA, T-SNE.
- Regression vs classification.
- Kernel method. Wikipedia
- Decision trees are suitable for nonlinear data Q7.
- Explain boosting, gradient boosting, GBDT.
- CRF is discriminative (measures conditional probability) whereas HMM is generative (measures joint probability).

## Neural Network

- Long Short Term Memory (LSTM)

First there are three gates: forget, information, and output.Then there’s a G gate updating the cell states. For output we have: - Transformer (Vaswani et al., 2017) Key, Query, Value “gates” per head, multi-head attention, positional encoding.
- Activations. ReLU vs. Sigmoid vs. Tanh.
- Gradient explosion / gradient vanishing. How to address them? (Use ReLU; LSTM gating + gradient clipping etc.)
- Batch normalization (Ioffe and Szegedy, 2015). Purpose is to address internal covariance shift. Need an additional “scale and shift back” step.

## Training

- Cross-entropy loss or L2 loss?
- Dropout. Explain them.
- Gradient optimizer: GD, SGD.
- Adagrad: adjust the gradients by . A previous large gradient corresponds to a smaller step.
- Momentum method: use gradients to change the velocity of the weights changing.
- Nesterov momentum: First jump, then measure gradients, and then make a correction.
- RMSprop: divide the gradient by a running average of its recent magnitude.
- Adam optimizer: a combination of Adagrad and RMSprop.
- See Hinton’s slides for more explainations.

- Regularization. L1 / L2. L1 (“Least Absolute Shrinkage and Selection Operator”, LASSO) corresponds to a Laplacian prior, with lots of weights around 0, hence encourage sparcity, hence are regarded as having built-in feature selection mechanism. L2 corresponds to a Gaussian prior.
- Handling imbalance classes.
- Handling missing data.

## Open questions

- What’s a favourite paper you recently read?
- What do you plan to do? (short-term / long-term research goals)

## NLP-basics

- Techniques for keyword normalization? Lemmatization and stemming.
- Techniques for string matching? Levenshtein (i.e., edit distance), soundex, and metaphone.
- Regular Expression (examples in Python).
`.`

(single dot) matches any character.`.*`

matches any character sequences of any length.`[0-9]`

matches any one digit.`[a-z]`

matches any one lower-case alphabet.

Brackets need to be escaped. e.g.,`\(`

,`\]`

, if you want to match them.`\`

need to be specially taken care of, in Python. - N-gram: Continuous N words as a bag.
- Zipf’s law. In a corpus, the frequency of a word is inversely proportional to its rank (by frequency).

## Information theory

- Entropy
- KL divergence
- Jensen-Shannon divergence

## NLP-Feature extractions and embeddings

- TF-IDF of the term = where term frequency is the term’s frequency, and document frequency is the freq of documents containing the term.
- Popular word embedding methods? GloVe, Word2Vec, FastText, ELMo.
- GloVe (Pennington et al., 2014): a global log-bilinear regression model trained on word-word co-occurrence counts.
- Word2Vec (Mikolov et al., 2013): Basically a two-layer networks predicting the context of a word with (skip-gram), or predict the projected word given its contexts with (CBOW). The output is written as in the above equations. Use the first layer output as word embedding.
- FastText (Facebook AI Research): An optimized implementation of Word2Vec, but incorporated sub-word information (e.g., including word components of lengths from 3 to 5 as)
- ELMo (Peters et al. 2018): Use intermediate layers outputs of Bi-LSTMs in additional to word embeddings as new embeddings.
- BERT (Devlin et al. 2018): Mask 15% of tokens. Predict them based on their contexts using Transformer.

- Topic modeling: LDA.

## NLP-Parsing

- Shift-reduce parser.
- Parsing: dependency vs. constituency parsing.

## NLP-Misc

- Word sense disambiguation example: Lesk algorithm. Compare the dictionary definition of an ambiguous word with the terms contained in its neighborhood.
- Recommendation system: HITS, PageRank, collaborative filtering.