CS224N 2019 Assignment 2

article/2025/10/5 11:20:54

Written: Understanding word2vec

Let’s have a quick refresher on the word2vec algorithm. The key insight behind word2vec is that ‘a word is known by the company it keeps’. Concretely, suppose we have a ‘center’ word c c c and a contextual window surrounding c c c. We shall refer to words that lie in this contextual window as ‘outside words’. For example, in Figure 1 we see that the center word c is ‘banking’. Since the context window size is 2, the outside words are ‘turning’, ‘into’, ‘crises’, and ‘as’.

The goal of the skip-gram word2vec algorithm is to accurately learn the probability distribution P ( O ∣ C ) P(O|C) P(OC). Given a specific word o o o and a specific word c c c, we want to calculate P ( O = o ∣ C = c ) P(O = o|C = c) P(O=oC=c), which is the probability that word o o o is an ‘outside’ word for c c c, i.e., the probability that o o o falls within the contextual window of c c c.

Figure 1: The word2vec skip-gram prediction model with window size 2

Figure 1: The word2vec skip-gram prediction model with window size 2

In word2vec, the conditional probability distribution is given by taking vector dot-products and applying the softmax function:
(1) P ( O = o ∣ C = c ) = exp ⁡ ( u o ⊤ v c ) ∑ w ∈ V o c a b exp ⁡ ( u w ⊤ v c ) P(O=o|C=c)=\frac{\exp (\mathbf{u_o}^\top v_c)}{\sum_{w\in Vocab}\exp(\mathbf{u_w}^\top v_c)}\tag{1} P(O=oC=c)=wVocabexp(uwvc)exp(uovc)(1)

Here, u o u_o uo is the ‘outside’ vector representing outside word o o o, and v c v_c vc is the ‘center’ vector representing center word c c c. To contain these parameters, we have two matrices, U U U and V V V . The columns of U U U are all the ‘outside’ vectors u w u_w uw . The columns of V V V are all of the ‘center’ vectors v w v_w vw . Both U U U and V V V contain a vector for every w ∈ w \in w Vocabulary.1

Recall from lectures that, for a single pair of words c c c and o o o, the loss is given by:
(2) J n a i v e − s o f t m a x ( v c , o , U ) = − log ⁡ P ( O = o ∣ C = c ) J_{naive-softmax}(v_c,o,U)=-\log P(O=o|C=c)\tag{2} Jnaivesoftmax(vc,o,U)=logP(O=oC=c)(2)
Another way to view this loss is as the cross-entropy2 between the true distribution y and the predicted distribution y ^ \hat{y} y^. Here, both y y y and y ^ \hat{y} y^ are vectors with length equal to the number of words in the vocabulary. Furthermore, the k t h k^{th} kth entry in these vectors indicates the conditional probability of the k t h k^{th} kth word being an ‘outside word’ for the given c c c. The true empirical distribution y y y is a one-hot vector with a 1 for the true outside word o o o, and 0 everywhere else. The predicted distribution y ^ \hat{y} y^ is the probability distribution P ( O ∣ C = c ) P(O|C = c) P(OC=c) given by our model in equation (1).

Question a

Show that the naive-softmax loss given in Equation (2) is the same as the cross-entropy loss between y y y and y ^ \hat{y} y^; i.e., show that

(3) − ∑ w ∈ V o c a b y w log ⁡ ( y ^ w ) = − log ⁡ ( y ^ o ) -\sum_{w\in{Vocab}}y_w\log(\hat{y}_w) = -\log(\hat{y}_o)\tag{3} wVocabywlog(y^w)=log(y^o)(3)

Ans for a

y w = { 1 , w = o 0 , w ≠ o y_w=\left\{ \begin{aligned} 1, w=o\\ 0, w\neq o \end{aligned} \right. yw={1,w=o0,w̸=o

− ∑ w ∈ V o c a b y w log ⁡ ( y ^ w ) = − y o log ⁡ ( y o ^ ) = − log ⁡ ( y o ^ ) -\sum_{w\in{Vocab}}y_w\log(\hat{y}_w)=-y_o\log(\hat{y_o})=-\log(\hat{y_o}) wVocabywlog(y^w)=yolog(yo^)=log(yo^)

Question b

Compute the partial derivative of J n a i v e − s o f t m a x ( v c , o , U ) J_{naive-softmax}(v_c,o,U) Jnaivesoftmax(vc,o,U) with respect to v c v_c vc. Please write your answer in terms of y y y, y ^ \hat{y} y^, and U U U.

Ans for b

∂ ∂ v c J n a i v e − s o f t m a x = − ∂ ∂ v c log ⁡ P ( O = o ∣ C = c ) = − ∂ ∂ v c log ⁡ exp ⁡ ( u o T v c ) ∑ w = 1 V exp ⁡ ( u w T v c ) = − ∂ ∂ v c log ⁡ exp ⁡ ( u o T v c ) + ∂ ∂ v c log ⁡ ∑ w = 1 V exp ⁡ ( u w T v c ) = − u o + 1 ∑ w = 1 V exp ⁡ ( u w T v c ) ∂ ∂ v c ∑ x = 1 V exp ⁡ ( u x T v c ) = − u o + 1 ∑ w = 1 V exp ⁡ ( u w T v c ) ∑ x = 1 V exp ⁡ ( u x T v c ) ∂ ∂ v c u x T v c = − u o + 1 ∑ w = 1 V exp ⁡ ( u w T v c ) ∑ x = 1 V exp ⁡ ( u x T v c ) u x = − u o + ∑ x = 1 V exp ⁡ ( u x T v c ) ∑ w = 1 V exp ⁡ ( u w T v c ) u x = − u o + ∑ x = 1 V P ( O = x ∣ C = c ) u x = − y T U T + y ^ T u T = U ( y ^ − y ) \begin{aligned} \frac{\partial}{\partial v_c} J_{naive-softmax} =& -\frac{\partial}{\partial v_c} \log P(O=o|C=c)\\ =& -\frac{\partial}{\partial v_c} \log \frac{\exp (u_o^T v_c)}{\sum_{w=1}^V \exp (u_w^T v_c)}\\ =& -\frac{\partial}{\partial v_c} \log \exp (u_o^T v_c) + \frac{\partial}{\partial v_c} \log \sum_{w=1}^V \exp (u_w^T v_c)\\ =& -u_o + \frac{1}{\sum_{w=1}^V \exp (u_w^T v_c)} \frac{\partial}{\partial v_c}\sum_{x=1}^V \exp (u_x^T v_c)\\ =& -u_o + \frac{1}{\sum_{w=1}^V \exp (u_w^T v_c)} \sum_{x=1}^V \exp (u_x^T v_c) \frac{\partial}{\partial v_c} u_x^T v_c\\ =& -u_o + \frac{1}{\sum_{w=1}^V \exp (u_w^T v_c)} \sum_{x=1}^V \exp (u_x^T v_c) u_x\\ =& -u_o + \sum_{x=1}^V \frac{\exp (u_x^T v_c)}{\sum_{w=1}^V \exp (u_w^T v_c)} u_x\\ =& -u_o + \sum_{x=1}^V P(O=x|C=c) u_x\\ =&-y^T U^T + \hat{y}^T u^T \\ =& U(\hat{y} - y) \end{aligned} vcJnaivesoftmax==========vclogP(O=oC=c)vclogw=1Vexp(uwTvc)exp(uoTvc)vclogexp(uoTvc)+vclogw=1Vexp(uwTvc)uo+w=1Vexp(uwTvc)1vcx=1Vexp(uxTvc)uo+w=1Vexp(uwTvc)1x=1Vexp(uxTvc)vcuxTvcuo+w=1Vexp(uwTvc)1x=1Vexp(uxTvc)uxuo+x=1Vw=1Vexp(uwTvc)exp(uxTvc)uxuo+x=1VP(O=xC=c)uxyTUT+y^TuTU(y^y)

Question c

Compute the partial derivatives of J n a i v e − s o f t m a x ( v c , o , U ) J_{naive-softmax}(v_c,o,U) Jnaivesoftmax(vc,o,U) with respect to each of the ‘outside’ word vectors, u w u_w uw's. There will be two cases: when w = o w = o w=o, the true ‘outside’ word vector, and w ≠ o w \neq o w̸=o, for all other words. Please write you answer in terms of y y y, y ^ \hat{y} y^, and v c v_c vc.

Ans for c

∂ ∂ u w J n a i v e − s o f t m a x = − ∂ ∂ u w log ⁡ exp ⁡ ( u o T v c ) ∑ m = 1 V exp ⁡ ( u m T v c ) = − ∂ ∂ u w log ⁡ exp ⁡ ( u o T v c ) + ∂ ∂ u w log ⁡ ∑ m = 1 V exp ⁡ ( u m T v c ) \begin{aligned} \frac{\partial}{\partial u_w}J_{naive-softmax}=& -\frac{\partial}{\partial u_w}\log\frac{\exp (u_o^T v_c)}{\sum_{m=1}^V \exp(u_m^T v_c)}\\ =& -\frac{\partial}{\partial u_w} \log\exp (u_o^T v_c)+ \frac{\partial}{\partial u_w}\log\sum_{m=1}^V \exp(u_m^T v_c)\\ \end{aligned} uwJnaivesoftmax==uwlogm=1Vexp(umTvc)exp(uoTvc)uwlogexp(uoTvc)+uwlogm=1Vexp(umTvc)

When w = o w=o w=o:

∂ ∂ u o J n a i v e − s o f t m a x = − v c + 1 ∑ m = 1 V exp ⁡ ( u m T ) ∑ n = 1 V ∂ ∂ u o exp ⁡ ( u n T v c ) = − v c + 1 ∑ m = 1 V exp ⁡ ( u m T ) ∂ ∂ u o exp ⁡ ( u o T v c ) = − v c + exp ⁡ ( u o T v c ) ∑ m = 1 V exp ⁡ ( u m T ) v c = − v c + P ( O = o ∣ C = c ) v c = ( P ( O = o ∣ C = c ) − 1 ) v c \begin{aligned} \frac{\partial}{\partial u_o}J_{naive-softmax}=& -v_c + \frac{1}{\sum_{m=1}^V \exp(u_m^T)}\sum_{n=1}^V \frac{\partial}{\partial u_o}\exp(u_n^T v_c)\\ =& -v_c + \frac{1}{\sum_{m=1}^V \exp(u_m^T)} \frac{\partial}{\partial u_o}\exp(u_o^T v_c)\\ =& -v_c + \frac{\exp(u_o^T v_c)}{\sum_{m=1}^V \exp(u_m^T)}v_c\\ =& -v_c + P(O=o|C=c)v_c\\ =&(P(O=o|C=c) - 1)v_c \end{aligned} uoJnaivesoftmax=====vc+m=1Vexp(umT)1n=1Vuoexp(unTvc)vc+m=1Vexp(umT)1uoexp(uoTvc)vc+m=1Vexp(umT)exp(uoTvc)vcvc+P(O=oC=c)vc(P(O=oC=c)1)vc

When w ≠ o w\neq o w̸=o:

∂ ∂ u w J n a i v e − s o f t m a x = ∂ ∂ u w log ⁡ ∑ m = 1 V exp ⁡ ( u m T v c ) = exp ⁡ ( u w T v c ) ∑ m = 1 V exp ⁡ ( u m T ) v c = P ( O = w ∣ C = c ) v c = ( P ( O = o ∣ C = c ) − 0 ) v c \begin{aligned} \frac{\partial}{\partial u_w}J_{naive-softmax}=& \frac{\partial}{\partial u_w}\log\sum_{m=1}^V \exp(u_m^T v_c)\\ =& \frac{\exp(u_w^T v_c)}{\sum_{m=1}^V \exp(u_m^T)}v_c\\ =& P(O=w|C=c)v_c\\ =& (P(O=o|C=c) - 0)v_c \end{aligned} uwJnaivesoftmax====uwlogm=1Vexp(umTvc)m=1Vexp(umT)exp(uwTvc)vcP(O=wC=c)vc(P(O=oC=c)0)vc

In summary:

∂ ∂ u w J n a i v e − s o f t m a x = ( y ^ w − y w ) v c \begin{aligned} \frac{\partial}{\partial u_w}J_{naive-softmax}=& (\hat{y}_w-y_w)v_c \end{aligned} uwJnaivesoftmax=(y^wyw)vc

Question d

The sigmoid function is given by Equation 4:
(4) σ ( x ) = 1 1 + e − x = e x e x + 1 \sigma(x)=\frac{1}{1+e^{-x}}=\frac{e^x}{e^x+1}\tag{4} σ(x)=1+ex1=ex+1ex(4)
Please compute the derivative of σ ( x ) \sigma(x) σ(x) with respect to x x x, where x x x is a vector.

Ans for d

∂ ∂ x σ ( x ) = ∂ ∂ x e x e x + 1 = ∂ ∂ y y y + 1 ∂ ∂ x e x = ∂ ∂ y ( 1 − 1 y + 1 ) ∂ ∂ x e x = ∂ ∂ y 1 y + 1 ∂ ∂ x e x = 1 y + 1 ∂ ∂ x e x = e x ( e x + 1 ) 2 = e x e x + 1 1 e x + 1 = e x e x + 1 e x + 1 − e x e x + 1 = e x e x + 1 ( 1 − e x e x + 1 ) = σ ( x ) ( 1 − σ ( x ) ) \begin{aligned} \frac{\partial}{\partial x}\sigma(x) =& \frac{\partial}{\partial x} \frac{e^x}{e^x + 1}\\ =& \frac{\partial}{\partial y}\frac{y}{y+1}\frac{\partial}{\partial x}e^x\\ =& \frac{\partial}{\partial y}(1-\frac{1}{y+1})\frac{\partial}{\partial x}e^x\\ =& \frac{\partial}{\partial y}\frac{1}{y+1}\frac{\partial}{\partial x}e^x\\ =& \frac{1}{y+1}\frac{\partial}{\partial x}e^x\\ =& \frac{e^x}{(e^x + 1)^2}\\ =& \frac{e^x}{e^x+1} \frac{1}{e^x+1}\\ =& \frac{e^x}{e^x+1} \frac{e^x+1-e^x}{e^x+1}\\ =& \frac{e^x}{e^x+1}(1-\frac{e^x}{e^x+1})\\ =& \sigma(x)(1-\sigma(x)) \end{aligned} xσ(x)==========xex+1exyy+1yxexy(1y+11)xexyy+11xexy+11xex(ex+1)2exex+1exex+11ex+1exex+1ex+1exex+1ex(1ex+1ex)σ(x)(1σ(x))

Question e

Now we shall consider the Negative Sampling loss, which is an alternative to the Naive Softmax loss. Assume that K K K negative samples (words) are drawn from the vocabulary. For simplicity of notation we shall refer to them as w 1 , w 2 , … , w K w_1,w_2,…,w_K w1,w2,,wK and their outside vectors as u 1 , … , u K u_1,…,u_K u1,,uK. Note that$ o\notin {w_1,…,w_K}$. For a center word c c c and an outside word o o o, the negative sampling loss function is given by:
(5) J n e g − s a m p l e ( v c , o , U ) = − log ⁡ ( σ ( u o ⊤ v c ) ) − ∑ K = 1 K log ⁡ ( σ ( − u k ⊤ v c ) ) J_{neg-sample}(v_c,o,U) =-\log(\sigma (\mathbf{u_o}^\top v_c)) -\sum_{K=1}^{K}\log(\sigma(-\mathbf{u_k}^\top v_c))\tag{5} Jnegsample(vc,o,U)=log(σ(uovc))K=1Klog(σ(ukvc))(5)
for a sample w 1 , . . . w K w_1, ... w_K w1,...wK , where σ ( ⋅ ) \sigma(\cdot) σ() is the sigmoid function3

Please repeat parts (b) and ©, computing the partial derivatives of J n e g − s a m p l e J_{neg-sample} Jnegsample with respect to v c v_c vc, with respect to u o u_o uo, and with respect to a negative sample u k u_k uk. Please write your answers in terms of the vectors u o u_o uo, v c v_c vc, and u k u_k uk, where k ∈ [ 1 , K ] k \in [1,K] k[1,K]. After you’ve done this, describe with one sentence why this loss function is much more efficient to compute than the naive-softmax loss. Note, you should be able to use your solution to part (d) to help compute the necessary gradients here.

Ans for e

∂ ∂ v c J n e g − s a m p l e = − ∂ ∂ v c log ⁡ ( σ ( u o T v c ) ) − ∂ ∂ v c ∑ k = 1 K log ⁡ ( σ ( − u k T v c ) ) = − 1 σ ( u o T v c ) ∂ ∂ v c σ ( u o T v c ) − ∑ k = 1 K 1 σ ( − u k T v c ) ∂ ∂ v c σ ( − u k T v c ) = − 1 σ ( u o T v c ) σ ( u o T v c ) ( 1 − σ ( u o T v c ) ) ∂ ∂ v c u o T v c − ∑ k = 1 K 1 σ ( − u k T v c ) σ ( − u k T v c ) ( 1 − σ ( u k T v c ) ) ∂ ∂ v c ( − u k T v c ) = ( σ ( u o T v c ) − 1 ) u o − ∑ k = 1 K ( σ ( − u k T v c ) − 1 ) u k ∂ ∂ u o J n e g − s a m p l e = − ∂ ∂ u o log ⁡ ( σ ( u o T v c ) ) − ∂ ∂ u o ∑ k = 1 K log ⁡ ( σ ( − u k T v c ) ) = − ∂ ∂ u o log ⁡ ( σ ( u o T v c ) ) = − 1 σ ( u o T v c ) ∂ ∂ u o σ ( u o T v c ) = − 1 σ ( u o T v c ) σ ( u o T v c ) ( 1 − σ ( u o T v c ) ) ∂ ∂ u o u o T v c = ( σ ( u o T v c ) − 1 ) v c ∂ ∂ u k J n e g − s a m p l e = − ∂ ∂ u k log ⁡ ( σ ( u o T v c ) ) − ∂ ∂ u k ∑ x = 1 K log ⁡ ( σ ( − u x T v c ) ) = − ∂ ∂ u k ∑ x = 1 K log ⁡ ( σ ( − u x T v c ) ) = − ∂ ∂ u k log ⁡ ( σ ( − u k T v c ) ) = − 1 σ ( − u k T v c ) ∂ ∂ u k σ ( − u k T v c ) = − 1 σ ( − u k T v c ) σ ( − u k T v c ) ( 1 − σ ( − u k T v c ) ) ∂ ∂ u k ( − u k T v c ) = ( 1 − σ ( − u k T v c ) ) v c \begin{aligned} \frac{\partial}{\partial v_c}J_{neg-sample}=& -\frac{\partial}{\partial v_c}\log(\sigma(u_o^T v_c)) -\frac{\partial}{\partial v_c}\sum_{k=1}^K \log(\sigma(-u_k^Tv_c))\\ =&-\frac{1}{\sigma(u_o^T v_c)}\frac{\partial}{\partial v_c}\sigma(u_o^T v_c) -\sum_{k=1}^K \frac{1}{\sigma(-u_k^T v_c)}\frac{\partial}{\partial v_c}\sigma(-u_k^T v_c)\\ =& -\frac{1}{\sigma(u_o^T v_c)}\sigma(u_o^T v_c)(1-\sigma(u_o^T v_c)) \frac{\partial}{\partial v_c}u_o^T v_c -\sum_{k=1}^K\frac{1}{\sigma(-u_k^T v_c)}\sigma(-u_k^T v_c)(1-\sigma(u_k^T v_c))\frac{\partial}{\partial v_c}(-u_k^T v_c)\\ =& (\sigma(u_o^T v_c)-1)u_o -\sum_{k=1}^K(\sigma(-u_k^T v_c) - 1)u_k\\ ~\\ \frac{\partial}{\partial u_o}J_{neg-sample}=& -\frac{\partial}{\partial u_o}\log(\sigma(u_o^T v_c)) -\frac{\partial}{\partial u_o}\sum_{k=1}^K\log(\sigma(-u_k^T v_c))\\ =& -\frac{\partial}{\partial u_o}\log(\sigma(u_o^T v_c))\\ =& -\frac{1}{\sigma(u_o^T v_c)}\frac{\partial}{\partial u_o}\sigma(u_o^T v_c)\\ =& -\frac{1}{\sigma(u_o^T v_c)}\sigma(u_o^T v_c)(1-\sigma(u_o^T v_c))\frac{\partial}{\partial u_o}u_o^T v_c\\ =& (\sigma(u_o^T v_c) - 1)v_c\\ ~\\ \frac{\partial}{\partial u_k}J_{neg-sample}=& -\frac{\partial}{\partial u_k}\log(\sigma(u_o^T v_c)) -\frac{\partial}{\partial u_k}\sum_{x=1}^K\log(\sigma(-u_x^T v_c))\\ =& -\frac{\partial}{\partial u_k}\sum_{x=1}^K\log(\sigma(-u_x^T v_c))\\ =& -\frac{\partial}{\partial u_k}\log(\sigma(-u_k^T v_c))\\ =& -\frac{1}{\sigma(-u_k^Tv_c)}\frac{\partial}{\partial u_k}\sigma(-u_k^T v_c)\\ =& -\frac{1}{\sigma(-u_k^Tv_c)}\sigma(-u_k^Tv_c)(1-\sigma(-u_k^Tv_c))\frac{\partial}{\partial u_k}(-u_k^Tv_c)\\ =& (1-\sigma(-u_k^Tv_c))v_c \end{aligned} vcJnegsample==== uoJnegsample===== ukJnegsample======vclog(σ(uoTvc))vck=1Klog(σ(ukTvc))σ(uoTvc)1vcσ(uoTvc)k=1Kσ(ukTvc)1vcσ(ukTvc)σ(uoTvc)1σ(uoTvc)(1σ(uoTvc))vcuoTvck=1Kσ(ukTvc)1σ(ukTvc)(1σ(ukTvc))vc(ukTvc)(σ(uoTvc)1)uok=1K(σ(ukTvc)1)ukuolog(σ(uoTvc))uok=1Klog(σ(ukTvc))uolog(σ(uoTvc))σ(uoTvc)1uoσ(uoTvc)σ(uoTvc)1σ(uoTvc)(1σ(uoTvc))uouoTvc(σ(uoTvc)1)vcuklog(σ(uoTvc))ukx=1Klog(σ(uxTvc))ukx=1Klog(σ(uxTvc))uklog(σ(ukTvc))σ(ukTvc)1ukσ(ukTvc)σ(ukTvc)1σ(ukTvc)(1σ(ukTvc))uk(ukTvc)(1σ(ukTvc))vc

Cause through this loss funtion, we don’t need to go through all word in vocabulary which cost expensive.

Question f

Suppose the center word is c = w t c = w_t c=wt and the context window is [ w t − m , . . . , w t − 1 , w t , w t + 1 , . . . , w t + m ] [w_{t−m}, . . ., w_{t−1}, w_t, w_{t+1}, . . ., w_{t+m}] [wtm,...,wt1,wt,wt+1,...,wt+m], where m m m is the context window size. Recall that for the skip-gram version of word2vec, the total loss for the context window is:
(6) J s k i p − g r a m ( v c , w t − m , . . . w t + m , U ) = ∑ − m ≤ j ≤ m , j ≠ 0 J ( v c , w t + j , U ) J_{skip-gram}(v_c,w_{t-m},...w_{t+m},U) =\sum_{-m\leq j \leq m, j\neq0}J(v_c,w{t+j},U)\tag{6} Jskipgram(vc,wtm,...wt+m,U)=mjm,j̸=0J(vc,wt+j,U)(6)
Here, J ( v c , w t + j , U ) J (v_c , w_{t+j} , U ) J(vc,wt+j,U) represents an arbitrary loss term for the center word c = w t c = w_t c=wt and outside word w t + j w_{t+j} wt+j. J ( v c , w t + j , U ) J(v_c,w_{t+j},U) J(vc,wt+j,U) could be J n a i v e − s o f t m a x ( v c , w t + j , U ) J_{naive-softmax}(v_c,w_{t+j},U) Jnaivesoftmax(vc,wt+j,U) or J n e g − s a m p l e ( v c , w t + j , U ) J_{neg-sample}(v_c,w_{t+j},U) Jnegsample(vc,wt+j,U), depending on your implementation.

Write down three partial derivatives:

  1. ∂ J s k i p − g r a m ( v c , w t − m , … w t + m , U ) ∂ U \frac{\partial J_{skip-gram}(v_c,w_{t-m},…w_{t+m},U)}{\partial U} UJskipgram(vc,wtm,wt+m,U)
  2. ∂ J s k i p − g r a m ( v c , w t − m , … w t + m , U ) ∂ v c \frac{\partial J_{skip-gram(v_c,w_{t-m,…w_{t+m},U})}}{\partial v_c} vcJskipgram(vc,wtm,wt+m,U)
  3. ∂ J s k i p − g r a m ( v c , w t − m , … w t + m , U ) ∂ v w \frac{\partial J_{skip-gram(v_c,w_{t-m},…w_{t+m},U)}}{\partial v_w} vwJskipgram(vc,wtm,wt+m,U) when w ≠ c w\neq c w̸=c

Write your answers in terms of ∂ J ( v c , w t + j , U ) ∂ U \frac{\partial J(v_c,w_{t+j},U)}{\partial U} UJ(vc,wt+j,U) and ∂ J ( v c , w t + j , U ) ∂ v c \frac{\partial J(v_c,w_{t+j},U)}{\partial v_c} vcJ(vc,wt+j,U). This is very simple - each solution should be one line.

Ans for f

∂ ∂ U J s k i p − g r a m ( v c , w t − m , … w t + m , U ) = ∑ − m ≤ j ≤ m J ( v c , w t + j , U ) ∂ U ∂ ∂ v c J s k i p − g r a m ( v c , w t − m , … w t + m , U ) = ∑ − m ≤ j ≤ m J ( v c , w t + j , U ) ∂ v c ∂ ∂ v w J s k i p − g r a m ( v c , w t − m , … w t + m , U ) = 0 \begin{aligned} \frac{\partial}{\partial U}J_{skip-gram}(v_c,w_{t-m},…w_{t+m},U) =& \sum_{-m\leq j\leq m}\frac{J(v_c, w_{t+j}, U)}{\partial U}\\ ~\\ \frac{\partial}{\partial v_c}J_{skip-gram(v_c,w_{t-m,…w_{t+m},U})} =& \sum_{-m\leq j\leq m}\frac{J(v_c, w_{t+j}, U)}{\partial v_c}\\ ~\\ \frac{\partial}{\partial v_w}J_{skip-gram(v_c,w_{t-m},…w_{t+m},U)} = & 0 \end{aligned} UJskipgram(vc,wtm,wt+m,U)= vcJskipgram(vc,wtm,wt+m,U)= vwJskipgram(vc,wtm,wt+m,U)=mjmUJ(vc,wt+j,U)mjmvcJ(vc,wt+j,U)0


  1. Assume that every word in our vocabulary is matched to an integer number k . u k k. u_k k.uk is both the k t h k ^{th} kth column of U U U and the ‘outside’ word vector for the word indexed by k . v k k. v_k k.vk is both the k t h k^{th} kth column of V V V and the ‘center’ word vector for the word indexed by k k k. In order to simplify notation we shall interchangeably use k k k to refer to the word and the index-of-the-word. ↩︎

  2. The Cross Entropy Loss between the true (discrete) probability distribution p p p and another distribution q q q is$ − \sum_{i}p_i log(q_i)$. ↩︎

  3. Note: the loss function here is the negative of what Mikolov et al. had in their original paper, because we are doing aminimization instead of maximization in our assignment code. Ultimately, this is the same objective function. ↩︎


http://chatgpt.dhexx.cn/article/lCb9DXsj.shtml

相关文章

【CS231N】

损失函数和后向传播 铰链损失函数:SVM常用,打击和正确结果相似度高的错误答案 正则化:获得更简单的模型,获得更平均的模型,避免过拟合(绿色线) Softmax:先指数计算(去除负…

Stanford CS230深度学习(一)

斯坦福CS230可以作为深度学习的入门课,最近我也在跟着看视频、完成编程作业。首先列出使用的资源链接,然后给出第一课的理解和编程作业的代码。 所有资料如下: 一、课程连接: b站课堂讲授版:Stanford CS230(吴恩达 …

csp-202206

202206 题目一:归一化处理【100分】题目二:寻宝!大冒险!【100分】题目三:角色授权【100分】题目四:光线追踪【15分】 题目一:归一化处理【100分】 水题,直接上 AC代码: …

cs229-1

本文全部参考自https://blog.csdn.net/stdcoutzyx?typeblog,仅作学习使用 文章目录 监督学习supervised learning线性回归局部加权回归LWR,Locally/Loess Weighted Regression最小二乘法的概率解释逻辑斯蒂回归logistic regression感知器算法牛顿方法NewTons Metho…

CS231n_learn

CS231n CS 程序:https://cs.stanford.edu/people/karpathy/convnetjs/demo/cifar10.html CS 课件http://cs231n.stanford.edu/slides/2017/: CS 课程首页:http://cs231n.stanford.edu/ CS 附带教程网页版:https://cs.stanford…

csp-202203

202203 题目一&#xff1a;未初始化警告【100分】题目二&#xff1a;出行计划【100分】题目三&#xff1a;计算资源调度器 【100分】 题目一&#xff1a;未初始化警告【100分】 简单数组操作题 #include<iostream> using namespace std; int n,k; bool ready[10000000]…

【CS231n系列】

Stanford-cs231n课程学习笔记&#xff08;一&#xff09; Stanford课程原版是英文&#xff0c;奈何本人英语菜的一批。原版网站放在下面&#xff0c;xdm可以多多学习。BUT&#xff01; B站up<同济子豪兄>yyds好吧&#xff01;&#xff01;&#xff01; Stanford231n 文章…

斯坦福CS230官方指南:CNN、RNN及使用技巧速查(打印收藏)

向AI转型的程序员都关注了这个号&#x1f447;&#x1f447;&#x1f447; 机器学习AI算法工程 公众号&#xff1a; datayx 作为全球计算机四大名校之一&#xff0c;斯坦福大学的CS230《深度学习》课程一直受到全球计算机学子和从业人员的热烈欢迎。 CS230授课人为全球著名计算…

CS230学习笔记(一)

CS230学习笔记(一) 1.前言 ok&#xff0c;前思后想&#xff0c;左思右想&#xff0c;我还是觉得自己得督促一下自己&#xff0c;所以&#xff0c;我觉得开始更新cs230的笔记&#xff0c;当然&#xff0c;我前面的六篇pytorch学习笔记我是不会放着不管的&#xff0c;后面肯定会…

目标检测(CS230)

内容来自CS230课程。 目录 目标定位&#xff08;Object localization&#xff09; 特征点检测&#xff08;Landmark detection&#xff09; 基于滑动窗口的目标检测算法 滑动窗口的卷积实现 &#xff08;Convolutional implementation of sliding windows&#xff09; 网络中的…

PHP配置环境变量

1.找到“高级系统设置”&#xff08;二选一的方法找到环境变量&#xff09; ① 我的电脑-属性-高级-环境变量 ②win8,10 直接在搜索框搜 “查看高级系统设置”-环境变量 2.找到变量"Path" ①加上 “E:\phpStudy\PHPTutorial\php\php-7.0.12-nts” &#xff08;php.e…

PHPstudy 设置PHP为环境变量

1.首先启动phpstudy点击‘切换版本’查看当前使用环境的php版本 2.在右键点击桌面的phpstudy图标进入文件夹位置 2.点击PHPTutorial->PHP 3.点击你的开发版本的php文件&#xff0c;我们会看到php.exe文件&#xff0c;复制当前文件位置路径 4.右键点击计算机或者我的电脑选择…

windows环境下设置多个PHP版本的环境变量

windows环境下设置多个PHP版本的环境变量 所在位置修改系统变量修改用户变量重启电脑 所在位置 我的电脑->属性->高级系统设置->高级->环境变量 根据图示&#xff0c;找到相应的变量 修改系统变量 环境变量->系统变量->Path 系统变量&#xff1a;把两个…

windows10的PHP环境变量

win10 环境变量配置 如何在命令行运行php文件 1.配置环境变量 2.进入php所在路径 然后输入 php 文件路径/文件名 即可 参考文献&#xff1a; https://blog.csdn.net/QQ2542920524/article/details/78692116

Windows环境下,PHPStudy设置环境变量

win7系统设置环境变量 1、选中计算机&#xff0c;点击 鼠标右键&#xff0c;选择属性 2、选择高级系统设置&#xff0c;打开&#xff0c;打开后选择高级&#xff0c;然后就能看到环境变量 3、打开环境变量&#xff0c;查找Path &#xff0c;选中path&#xff0c;再点击编辑即可…

【PHP】配置环境变量,查看php版本(保姆级图文)

目录 配置环境变量找到php所在的目录&#xff08;有一个php.exe文件&#xff09;环境变量path中添加重启电脑&#xff08;可选&#xff09; 查看php版本&#xff08;检测是否成功配置了php&#xff09;总结 『PHP』分享PHP环境配置到项目实战个人学习笔记。 欢迎关注 『PHP』 系…

windows中设置php环境变量

1.我的电脑-》右键&#xff08;选择我的属性&#xff09; 2.点击高级设置 3.点击环境变量 4.在系统变量中找到Path 点击 5.找到php.exe的文件目录&#xff0c;添加到path中 6.php -v 显示版本&#xff0c;表示成功

Linux有多个php版本的时候指定php版本设置环境变量

最近在安装swoole的时候老是出错&#xff0c;安装完成以后再php-m中能看到swoole扩展已经开启&#xff0c;而在 phpinfo中却看不到。查看了下php.ini的位置发现这两个指向的路径不同。查看了下安装的php有两个版本&#xff0c;一个是自带的在/usr/bin/php&#xff0c;一个是自己…

windows设置php环境变量

1、找到要设置的php版本路径,然后进行复制 2、添加环境变量&#xff08;控制面板->高级系统设置->环境变量->最上方的李硕的系统变量Path->新增两条php路径即可&#xff09; 3、打开cmd输入php-v 即可查看添加的php版本信息 最后问题没有解决的话&#xff0c;或者有…

配置windows系统中 PHP的环境变量

1&#xff0e; 首先到php官网下载php-5.3.6-nts-Win32-VC9-x86.ZIP 解压到电脑硬盘。将文件解压到文件夹php5.3.6 下载地址&#xff1a;http://www.php.net/downloads.php 2&#xff0e; 将解压后的php5.3.6文件夹放到E:\Program Files文件夹下面 3&#xff0e; php目录下的“p…