Hundred Interview Questions for Deep Learning
I have been interviewing for deep learning jobs recently, and am disappointed in the quality of the questions asked during interviews.
So here is my list of 100 questions that an interviewer can ask in a deep learning interview.
Usage Guidelines
- It is aimed to test breadth, rather than depth, of knowledge.
- If the interviewee cannot answer the question, I recommend to move on immediately to the next question instead of wasting precious interview time getting them to answer the question the way you would have done it yourself.
- These questions are difficult, and one should not expect even excellent candidates to be able to answer more than half of these questions in an interview setting (I certainly can’t!).
- It covers a mix of popular techniques/architectures in the deep learning literature, as well as basic math/ML. I did my best to exclude anything that is too esoteric/recent/obsolete, but many such questions will no doubt leak through the cracks.
A major caveat to this list is that it is based on the breadth of my personal knowledge, and thus will reflect my own personal blind spots (I know very little about graph neural networks or reinforcement learning for example). But I am welcoming any suggestions that can make it better, including questions from the two aforementioned topics.
If there are any questions you want added to the list or corrections to the list you wish made, please email me at [firstname].[lastname]@columbia.edu. I will incorporate any feedback to the list at my own time/discretion, thanks! (Also, please do not email me asking for answers. I might add them when I find some time.)
I hope this helps deep learning interviewers ask better questions, and helps interviewees cover their bases.
Questions
In no particular order, here are the questions:
1) What does it mean to regularize a model?
2) What is dropout? How does it differ between training and test time?
3) What are some extensions/variants of dropout?
4) What is batch normalization? How does it differ between training and test time?
5) Describe a few other normalization layers that are not batchnorm.
6) What are the pros and cons of these normalization layers? Why do we use layernorm and not batchnorm for Transformers?
7) What is the bias-variance trade-off? Does it only apply to the MSE loss?
8) Why is L2 regularization said to “convexify” a loss function?
9) What is double descent, and how is it different from the classical understanding of overparametrization?
10) What is the ReLU activation function?
11) What are some extensions/variants of ReLU? What’s your favorite activation function?
12) What is the logistic/sigmoid function? What is softmax?
13) What does the temperature of softmax refer to? If the ground truth is noisy, is it better to have a high temperature or a low one?
14) What is label smoothing?
15) What is mixup?
16) What are some common data augmentations used in computer vision? NLP? Speech?
17) What is transfer learning?
18) What is distillation?
19) What is FGSM and PGD in the context of adversarial attacks? What is a random restart?
20) What are some naive adversarial defenses and why do they fail?
21) What is a Generative Adversarial Network? Can you write down its loss function?
22) Are the generator and the discriminator in a GAN trained simultaneously or in alternating steps?
23) What are some extensions/variants of GANs?
24) What is an autoencoder?
25) What are some extensions/variants of autoencoders?
26) What is a normalizing flow? Can you give examples?
27) How do we evaluate generative models for images? Language? Speech?
28) What is the difference between a discriminative model and a generative model?
29) What are some main classes of deep generative models? Describe their pros and cons.
30) Describe the REINFORCE trick.
31) Describe the reparametrization trick.
32) What is Gumbel-Softmax / Concrete distribution?
33) What is a recurrent neural network?
34) What is an LSTM/GRU and why do we prefer them over vanilla RNNs?
35) What is a Transformer and why do we prefer it over an LSTM in NLP?
36) What is the O(n^2) bottleneck in a Transformer and how can we do better?
37) What is a word embedding? How is BERT different from word2vec?
38) Describe different tokenization schemes used in NLP.
39) What is a convolutional neural network?
40) What does padding/stride/dilation refer to?
41) What are some extensions/variants of a convolutional layer/NN?
42) What is deconvolution? How do we minimize checkerboard artifacts?
43) What is a UNet?
44) What is DeepLab?
45) What is the difference between a ResNet and a DenseNet?
46) What is NeRF?
47) What is a Capsule network?
48) What is a Fourier Transform? What is the convolution theorem?
49) What does the convolution of two pdfs correspond to?
50) What is a WaveNet?
51) What is a Neural ODE?
52) What is a hypernetwork?
53) What is MAML?
54) What is neural architecture search?
55) Describe popular classes of methods for zero-shot/few-shot/meta learning.
56) What is an autoregressive model?
57) What is a mixture density network? How do we train it?
58) What is a mixture of experts?
59) What is a self-normalizing neural network?
60) What are some common ways of initializing the weights of a neural network?
61) Can you write down the formula for Xavier/Kaiming init?
62) If X and Y are independent random variables, can you write down the formula for Var(XY)?
63) What is the lottery ticket hypothesis?
64) Can you write down the pdf of a Gaussian distribution in the single variable case?
65) What are some nice properties of a Gaussian?
66) What is the Central Limit Theorem?
67) It is said that neural networks can be viewed as approximate Gaussian Processes. Why?
68) What is the Neural Tangent Kernel?
69) How does neural style transfer work?
70) What is a CycleGAN?
71) It is said that low layers of a CNN act like Gabor filters and edge detectors, while high layers pick up semantic content. How do we know this?
72) What is the binary cross entropy loss? Why do we prefer it over a 0-1 loss?
73) What are some common loss functions and metrics used in segmentation?
74) What does it mean to have a well-calibrated neural network?
75) What are some methods for calibrating deep classifiers?
76) What is stochastic gradient descent? Why do we minus gradients rather than add them when training a neural network?
77) What is the vanishing / exploding gradients problem, and how do we deal with it?
78) What is backpropagation? Can you name some alternatives to backpropagation for training a neural network?
79) How does the Adam optimizer work?
80) What are some caveats to using Adam?
81) How do we calculate Hessian vector products efficiently?
82) What is Newton’s method for optimization?
83) What is the conjugate gradient method?
84) What is an eigenvector? Principal components are eigenvectors of what matrix?
85) What is the spectral theorem?
86) What is the rank of a matrix? What does low-rank matrix factorization refer to?
87) What is the condition number of a matrix? What is a well-conditioned matrix?
88) What is Jensen’s inequality? What is nice about a convex loss function?
89) What is power iteration? Can you give examples of power iteration used in deep learning?
90) What is a Relation Network?
91) What is a Deep Set?
92) What are some deep learning methods for learning to rank?
93) What is a Lipschitz function? Are deep networks Lipschitz?
94) What is domain adversarial training / confusion loss?
95) What is rejection sampling and inverse transform sampling? Why might they be inefficient?
96) How do we uniformly sample from the surface of an L2 sphere? What about from its interior?
97) What is importance sampling?
98) What are methods to train quantized neural networks?
99) What are methods to do unsupervised/self-supervised learning in preparation for a downstream computer vision task other than pre-training on ImageNet?
100) What is co-training? What is self-training?