Must Know Tips/tricks in DNN

article/2025/9/28 16:03:41
tricks 

Deep Neural Networks, especially Convolutional Neural Networks (CNN), allows computational models that are composed of multiple processing layers to learn representations of data with multiple levels of abstraction. These methods have dramatically improved the state-of-the-arts in visual object recognition, object detection, text recognition and many other domains such as drug discovery and genomics.

In addition, many solid papers have been published in this topic, and some high quality open source CNN software packages have been made available. There are also well-written CNN tutorials or CNN software manuals. However, it might lack a recent and comprehensive summary about the details of how to implement an excellent deep convolutional neural networks from scratch. Thus, we collected and concluded many implementation details for DCNNs.Here we will introduce these extensive implementation details, i.e.,tricks or tips, for building and training your own deep networks.

Introduction

We assume you already know the basic knowledge of deep learning, and here we will present the implementation details (tricks or tips) in Deep Neural Networks, especially CNN for image-related tasks, mainly ineight aspects:1) data augmentation; 2) pre-processing on images;3) initializations of Networks;4) some tips during training; 5) selections of activation functions;6)diverse regularizations; 7) some insights found from figures and finally8)methods of ensemble multiple deep networks.

Additionally, the corresponding slides are available at [slide]. If there are any problems/mistakes in these materials and slides, or there are something important/interesting you consider that should be added, just feel free to contact me.

Sec. 1: Data Augmentation

Since deep networks need to be trained on a huge number of training images to achieve satisfactory performance, if the original image data set contains limited training images, it is better to do data augmentation to boost the performance. Also, data augmentation becomes the thing must to do when training a deep network.

  • There are many ways to do data augmentation, such as the popular horizontally flipping,random crops andcolor jittering. Moreover, you could try combinations of multiple different processing, e.g., doing the rotation and random scaling at the same time. In addition, you can try to raise saturation and value (S and V components of the HSV color space) of all pixels to a power between 0.25 and 4 (same for all pixels within a patch), multiply these values by a factor between 0.7 and 1.4, and add to them a value between -0.1 and 0.1. Also, you could add a value between [-0.1, 0.1] to the hue (H component of HSV) of all pixels in the image/patch.

  • Krizhevsky et al. [1] proposed fancy PCA when training the famous Alex-Net in 2012. Fancy PCA alters the intensities of the RGB channels in training images. In practice, you can firstly perform PCA on the set of RGB pixel values throughout your training images. And then, for each training image, just add the following quantity to each RGB image pixel (i.e., I_{xy}=[I_{xy}^R,I_{xy}^G,I_{xy}^B]^T):bf{p}_1,bf{p}_2,bf{p}_3][alpha_1 lambda_1,alpha_2 lambda_2,alpha_3 lambda_3]^T where,bf{p}_i andlambda_i are thei-th eigenvector and eigenvalue of the3times 3 covariance matrix of RGB pixel values, respectively, andalpha_i is a random variable drawn from a Gaussian with mean zero and standard deviation 0.1. Please note that, eachalpha_i is drawn only once for all the pixels of a particular training image until that image is used for training again. That is to say, when the model meets the same training image again, it will randomly produce anotheralpha_i for data augmentation. In[1], they claimed that “fancy PCA could approximately capture an important property of natural images, namely, that object identity is invariant to changes in the intensity and color of the illumination”. To the classification performance, this scheme reduced the top-1 error rate by over 1% in the competition of ImageNet 2012.

Sec. 2: Pre-Processing

Now we have obtained a large number of training samples (images/crops), but please do not hurry! Actually, it is necessary to do pre-processing on these images/crops. In this section, we will introduce several approaches for pre-processing.

The first and simple pre-processing approach is zero-center the data, and thennormalize them, which is presented as two linesPython codes as follows:

>>> X -= np.mean(X, axis = 0) # zero-center
>>> X /= np.std(X, axis = 0) # normalize

where, X is the input data (NumIns×NumDim). Another form of this pre-processing normalizes each dimension so that the min and max along the dimension is -1 and 1 respectively. It only makes sense to apply this pre-processing if you have a reason to believe that different input features have different scales (or units), but they should be of approximately equal importance to the learning algorithm. In case of images, the relative scales of pixels are already approximately equal (and in range from 0 to 255), so it is not strictly necessary to perform this additional pre-processing step.

Another pre-processing approach similar to the first one is PCA Whitening. In this process, the data is first centered as described above. Then, you can compute the covariance matrix that tells us about the correlation structure in the data:

>>> X -= np.mean(X, axis = 0) # zero-center
>>> cov = np.dot(X.T, X) / X.shape[0] # compute the covariance matrix

After that, you decorrelate the data by projecting the original (but zero-centered) data into the eigenbasis:

>>> U,S,V = np.linalg.svd(cov) # compute the SVD factorization of the data covariance matrix
>>> Xrot = np.dot(X, U) # decorrelate the data

The last transformation is whitening, which takes the data in the eigenbasis and divides every dimension by the eigenvalue to normalize the scale:

>>> Xwhite = Xrot / np.sqrt(S + 1e-5) # divide by the eigenvalues (which are square roots of the singular values)

Note that here it adds 1e-5 (or a small constant) to prevent division by zero. One weakness of this transformation is that it can greatly exaggerate the noise in the data, since it stretches all dimensions (including the irrelevant dimensions of tiny variance that are mostly noise) to be of equal size in the input. This can in practice be mitigated by stronger smoothing (i.e., increasing 1e-5 to be a larger number).

Please note that, we describe these pre-processing here just for completeness. In practice, these transformations are not used with Convolutional Neural Networks. However, it is also very important tozero-center the data, and it is common to see normalization of every pixel as well.

Sec. 3: Initializations

Now the data is ready. However, before you are beginning to train the network, you have to initialize its parameters.

All Zero Initialization

In the ideal situation, with proper data normalization it is reasonable to assume that approximately half of the weights will be positive and half of them will be negative. A reasonable-sounding idea then might be to setall the initial weights to zero, which you expect to be the “best guess” in expectation. But, this turns out to be a mistake, because if every neuron in the network computes the same output, then they will also all compute the same gradients during back-propagation and undergo the exact same parameter updates. In other words, there is no source of asymmetry between neurons if their weights are initialized to be the same.

Initialization with Small Random Numbers

Thus, you still want the weights to be very close to zero, but not identically zero. In this way, you can random these neurons to small numbers which are very close to zero, and it is treated assymmetry breaking. The idea is that the neurons are all random and unique in the beginning, so they will compute distinct updates and integrate themselves as diverse parts of the full network. The implementation for weights might simply look likeweightssim 0.001times N(0,1), whereN(0,1) is a zero mean, unit standard deviation gaussian. It is also possible to use small numbers drawn from a uniform distribution, but this seems to have relatively little impact on the final performance in practice.

Calibrating the Variances

One problem with the above suggestion is that the distribution of the outputs from a randomly initialized neuron has a variance that grows with the number of inputs. It turns out that you can normalize the variance of each neuron's output to 1 by scaling its weight vector by the square root of its fan-in (i.e., its number of inputs), which is as follows:

>>> w = np.random.randn(n) / sqrt(n) # calibrating the variances with 1/sqrt(n)

where “randn” is the aforementioned Gaussian and “n” is the number of its inputs. This ensures that all neurons in the network initially have approximately the same output distribution and empirically improves the rate of convergence. The detailed derivations can be found from Page. 18 to 23 of the slides. Please note that, in the derivations, it does not consider the influence of ReLU neurons.

Current Recommendation

As aforementioned, the previous initialization by calibrating the variances of neurons is without considering ReLUs. A more recent paper on this topic by Heet al.[4] derives an initialization specifically for ReLUs, reaching the conclusion that the variance of neurons in the network should be2.0/n as:

>>> w = np.random.randn(n) * sqrt(2.0/n) # current recommendation

which is the current recommendation for use in practice, as discussed in [4].

Sec. 4: During Training

Now, everything is ready. Let’s start to train deep networks!

  • Filters and pooling size. During training, the size of input images prefers to be power-of-2, such as 32 (e.g.,CIFAR-10), 64, 224 (e.g., common usedImageNet), 384 or 512, etc. Moreover, it is important to employ a small filter (e.g.,3times 3) and small strides (e.g., 1) with zeros-padding, which not only reduces the number of parameters, but improves the accuracy rates of the whole deep network. Meanwhile, a special case mentioned above, i.e.,3times 3 filters with stride 1, could preserve the spatial size of images/feature maps. For the pooling layers, the common used pooling size is of2times 2.

  • Learning rate. In addition, as described in a blog by Ilya Sutskever[2], he recommended to divide the gradients by mini batch size. Thus, you should not always change the learning rates (LR), if you change the mini batch size. For obtaining an appropriate LR, utilizing the validation set is an effective way. Usually, a typical value of LR in the beginning of your training is 0.1. In practice, if you see that you stopped making progress on the validation set, divide the LR by 2 (or by 5), and keep going, which might give you a surprise.

  • Fine-tune on pre-trained models. Nowadays, many state-of-the-arts deep networks are released by famous research groups, i.e.,Caffe Model Zoo andVGG Group. Thanks to the wonderful generalization abilities of pre-trained deep models, you could employ these pre-trained models for your own applications directly. For further improving the classification performance on your data set, a very simple yet effective approach is to fine-tune the pre-trained models on your own data. As shown in following table, the two most important factors are the size of the new data set (small or big), and its similarity to the original data set. Different strategies of fine-tuning can be utilized in different situations. For instance, a good case is that your new data set is very similar to the data used for training pre-trained models. In that case, if you have very little data, you can just train a linear classifier on the features extracted from the top layers of pre-trained models. If your have quite a lot of data at hand, please fine-tune a few top layers of pre-trained models with a small learning rate. However, if your own data set is quite different from the data used in pre-trained models but with enough training images, a large number of layers should be fine-tuned on your data also with a small learning rate for improving performance. However, if your data set not only contains little data, but is very different from the data used in pre-trained models, you will be in trouble. Since the data is limited, it seems better to only train a linear classifier. Since the data set is very different, it might not be best to train the classifier from the top of the network, which contains more dataset-specific features. Instead, it might work better to train the SVM classifier on activations/features from somewhere earlier in the network.

table 

Fine-tune your data on pre-trained models. Different strategies of fine-tuning are utilized in different situations. For data sets,Caltech-101 is similar toImageNet, where both two are object-centric image data sets; whilePlace Database is different fromImageNet, where one is scene-centric and the other is object-centric.

Sec. 5: Activation Functions

One of the crucial factors in deep networks is activation function, which brings thenon-linearity into networks. Here we will introduce the details and characters of some popular activation functions and give advices later in this section.

neuron 

Figures courtesy of Stanford CS231n.

Sigmoid

sigmod 

The sigmoid non-linearity has the mathematical form sigma(x)=1/(1+e^{-x}). It takes a real-valued number and “squashes” it into range between 0 and 1. In particular, large negative numbers become 0 and large positive numbers become 1. The sigmoid function has seen frequent use historically since it has a nice interpretation as the firing rate of a neuron: from not firing at all (0) to fully-saturated firing at an assumed maximum frequency (1).

In practice, the sigmoid non-linearity has recently fallen out of favor and it is rarely ever used. It has two major drawbacks:

  1. Sigmoids saturate and kill gradients. A very undesirable property of the sigmoid neuron is that when the neuron's activation saturates at either tail of 0 or 1, the gradient at these regions is almost zero. Recall that during back-propagation, this (local) gradient will be multiplied to the gradient of this gate's output for the whole objective. Therefore, if the local gradient is very small, it will effectively “kill” the gradient and almost no signal will flow through the neuron to its weights and recursively to its data. Additionally, one must pay extra caution when initializing the weights of sigmoid neurons to prevent saturation. For example, if the initial weights are too large then most neurons would become saturated and the network will barely learn.

  2. Sigmoid outputs are not zero-centered. This is undesirable since neurons in later layers of processing in a Neural Network (more on this soon) would be receiving data that is not zero-centered. This has implications on the dynamics during gradient descent, because if the data coming into a neuron is always positive (e.g., x>0 element wise inf=w^Tx+b), then the gradient on the weightsw will during back-propagation become either all be positive, or all negative (depending on the gradient of the whole expressionf). This could introduce undesirable zig-zagging dynamics in the gradient updates for the weights. However, notice that once these gradients are added up across a batch of data the final update for the weights can have variable signs, somewhat mitigating this issue. Therefore, this is an inconvenience but it has less severe consequences compared to the saturated activation problem above.

tanh(x)

tanh 

The tanh non-linearity squashes a real-valued number to the range [-1, 1]. Like the sigmoid neuron, its activations saturate, but unlike the sigmoid neuron its output is zero-centered. Therefore, in practice the tanh non-linearity is always preferred to the sigmoid nonlinearity.

Rectified Linear Unit

relu 

The Rectified Linear Unit (ReLU) has become very popular in the last few years. It computes the functionf(x)=max(0,x), which is simply thresholded at zero.

There are several pros and cons to using the ReLUs:

  1. (Pros) Compared to sigmoid/tanh neurons that involve expensive operations (exponentials, etc.), the ReLU can be implemented by simply thresholding a matrix of activations at zero. Meanwhile, ReLUs does not suffer from saturating.

  2. (Pros) It was found to greatly accelerate (e.g., a factor of 6 in [1]) the convergence of stochastic gradient descent compared to the sigmoid/tanh functions. It is argued that this is due to its linear, non-saturating form.

  3. (Cons) Unfortunately, ReLU units can be fragile during training and can “die”. For example, a large gradient flowing through a ReLU neuron could cause the weights to update in such a way that the neuron will never activate on any datapoint again. If this happens, then the gradient flowing through the unit will forever be zero from that point on. That is, the ReLU units can irreversibly die during training since they can get knocked off the data manifold. For example, you may find that as much as 40% of your network can be “dead” (i.e., neurons that never activate across the entire training dataset) if the learning rate is set too high. With a proper setting of the learning rate this is less frequently an issue.

Leaky ReLU

lrelu 

Leaky ReLUs are one attempt to fix the “dying ReLU” problem. Instead of the function being zero whenx<0, a leaky ReLU will instead have a small negative slope (of 0.01, or so). That is, the function computesf(x)=alpha x ifx<0 andf(x)=x ifxgeq 0, wherealpha is a small constant. Some people report success with this form of activation function, but the results are not always consistent.

Parametric ReLU

Nowadays, a broader class of activation functions, namely the rectified unit family, were proposed. In the following, we will talk about the variants of ReLU.

relufamily 

ReLU, Leaky ReLU, PReLU and RReLU. In these figures, for PReLU, alpha_i is learned and for Leaky ReLUalpha_i is fixed. For RReLU,alpha_{ji} is a random variable keeps sampling in a given range, and remains fixed in testing.

The first variant is called parametric rectified linear unit (PReLU)[4]. In PReLU, the slopes of negative part are learned from data rather than pre-defined. Heet al. [4] claimed that PReLU is the key factor of surpassing human-level performance onImageNet classification task. The back-propagation and updating process of PReLU is very straightforward and similar to traditional ReLU, which is shown in Page. 43 of the slides.

Randomized ReLU

The second variant is called randomized rectified linear unit (RReLU). In RReLU, the slopes of negative parts are randomized in a given range in the training, and then fixed in the testing. As mentioned in[5], in a recent KaggleNational Data Science Bowl (NDSB) competition, it is reported that RReLU could reduce overfitting due to its randomized nature. Moreover, suggested by the NDSB competition winner, the randomalpha_i in training is sampled from1/U(3,8) and in test time it is fixed as its expectation, i.e.,2/(l+u)=2/11.

In [5], the authors evaluated classification performance of two state-of-the-art CNN architectures with different activation functions on theCIFAR-10,CIFAR-100 and NDSB data sets, which are shown in the following tables.Please note that, for these two networks, activation function is followed by each convolutional layer. And thea in these tables actually indicates1/alpha, wherealpha is the aforementioned slopes.

relures  

From these tables, we can find the performance of ReLU is not the best for all the three data sets. For Leaky ReLU, a larger slopealpha will achieve better accuracy rates. PReLU is easy to overfit on small data sets (its training error is the smallest, while testing error is not satisfactory), but still outperforms ReLU. In addition, RReLU is significantly better than other activation functions on NDSB, which shows RReLU can overcome overfitting, because this data set has less training data than that of CIFAR-10/CIFAR-100.In conclusion, three types of ReLU variants all consistently outperform the original ReLU in these three data sets. And PReLU and RReLU seem better choices. Moreover, Heet al. also reported similar conclusions in[4].

Sec. 6: Regularizations

There are several ways of controlling the capacity of Neural Networks to prevent overfitting:

  • L2 regularization is perhaps the most common form of regularization. It can be implemented by penalizing the squared magnitude of all parameters directly in the objective. That is, for every weightw in the network, we add the termfrac{1}{2}lambda w^2 to the objective, wherelambda is the regularization strength. It is common to see the factor offrac{1}{2} in front because then the gradient of this term with respect to the parameterw is simplylambda w instead of2lambda w. The L2 regularization has the intuitive interpretation of heavily penalizing peaky weight vectors and preferring diffuse weight vectors.

  • L1 regularization is another relatively common form of regularization, where for each weightw we add the termlambda |w| to the objective. It is possible to combine the L1 regularization with the L2 regularization:lambda_1 |w|+lambda_2 w^2 (this is calledElastic net regularization). The L1 regularization has the intriguing property that it leads the weight vectors to become sparse during optimization (i.e. very close to exactly zero). In other words, neurons with L1 regularization end up using only a sparse subset of their most important inputs and become nearly invariant to the “noisy” inputs. In comparison, final weight vectors from L2 regularization are usually diffuse, small numbers. In practice, if you are not concerned with explicit feature selection, L2 regularization can be expected to give superior performance over L1.

  • Max norm constraints. Another form of regularization is to enforce an absolute upper bound on the magnitude of the weight vector for every neuron and use projected gradient descent to enforce the constraint. In practice, this corresponds to performing the parameter update as normal, and then enforcing the constraint by clamping the weight vectorvec{w} of every neuron to satisfyparallel vec{w} parallel_2 <c. Typical values ofc are on orders of 3 or 4. Some people report improvements when using this form of regularization. One of its appealing properties is that network cannot “explode” even when the learning rates are set too high because the updates are always bounded.

  • Dropout is an extremely effective, simple and recently introduced regularization technique by Srivastavaet al. in[6] that complements the other methods (L1, L2, maxnorm). During training, dropout can be interpreted as sampling a Neural Network within the full Neural Network, and only updating the parameters of the sampled network based on the input data. (However, the exponential number of possible sampled networks are not independent because they share the parameters.) During testing there is no dropout applied, with the interpretation of evaluating an averaged prediction across the exponentially-sized ensemble of all sub-networks (more about ensembles in the next section). In practice, the value of dropout ratiop=0.5 is a reasonable default, but this can be tuned on validation data.

dropout 

The most popular used regularization technique dropout [6]. While training, dropout is implemented by only keeping a neuron active with some probabilityp (a hyper-parameter), or setting it to zero otherwise. In addition, Google applied for aUS patent fordropout in 2014.

Sec. 7: Insights from Figures

Finally, from the tips above, you can get the satisfactory settings (e.g., data processing, architectures choices and details, etc.) for your own deep networks. During training time, you can draw some figures to indicate your networks’ training effectiveness.

  • As we have known, the learning rate is very sensitive. From Fig. 1 in the following, a very high learning rate will cause a quite strange loss curve. A low learning rate will make your training loss decrease very slowly even after a large number of epochs. In contrast, a high learning rate will make training loss decrease fast at the beginning, but it will also drop into a local minimum. Thus, your networks might not achieve a satisfactory results in that case. For a good learning rate, as the red line shown in Fig. 1, its loss curve performs smoothly and finally it achieves the best performance.

  • Now let’s zoom in the loss curve. The epochs present the number of times for training once on the training data, so there are multiple mini batches in each epoch. If we draw the classification loss every training batch, the curve performs like Fig. 2. Similar to Fig. 1, if the trend of the loss curve looks too linear, that indicates your learning rate is low; if it does not decrease much, it tells you that the learning rate might be too high. Moreover, the “width” of the curve is related to the batch size. If the “width” looks too wide, that is to say the variance between every batch is too large, which points out you should increase the batch size.

  • Another tip comes from the accuracy curve. As shown in Fig. 3, the red line is the training accuracy, and the green line is the validation one. When the validation accuracy converges, the gap between the red line and the green one will show the effectiveness of your deep networks. If the gap is big, it indicates your network could get good accuracy on the training data, while it only achieve a low accuracy on the validation set. It is obvious that your deep model overfits on the training set. Thus, you should increase the regularization strength of deep networks. However, no gap meanwhile at a low accuracy level is not a good thing, which shows your deep model has low learnability. In that case, it is better to increase the model capacity for better results.

trainfigs  

Sec. 8: Ensemble

In machine learning, ensemble methods [8] that train multiple learners and then combine them for use are a kind of state-of-the-art learning approach. It is well known that an ensemble is usually significantly more accurate than a single learner, and ensemble methods have already achieved great success in many real-world tasks. In practical applications, especially challenges or competitions, almost all the first-place and second-place winners used ensemble methods.

Here we introduce several skills for ensemble in the deep learning scenario.

  • Same model, different initialization. Use cross-validation to determine the best hyperparameters, then train multiple models with the best set of hyperparameters but with different random initialization. The danger with this approach is that the variety is only due to initialization.

  • Top models discovered during cross-validation. Use cross-validation to determine the best hyperparameters, then pick the top few (e.g., 10) models to form the ensemble. This improves the variety of the ensemble but has the danger of including suboptimal models. In practice, this can be easier to perform since it does not require additional retraining of models after cross-validation. Actually, you could directly select several state-of-the-art deep models fromCaffe Model Zoo to perform ensemble.

  • Different checkpoints of a single model. If training is very expensive, some people have had limited success in taking different checkpoints of a single network over time (for example after every epoch) and using those to form an ensemble. Clearly, this suffers from some lack of variety, but can still work reasonably well in practice. The advantage of this approach is that is very cheap.

  • Some practical examples. If your vision tasks are related to high-level image semantic, e.g., event recognition from still images, a better ensemble method is to employ multiple deep models trained on different data sources to extract different and complementary deep representations. For example in the Cultural Event Recognition challenge in associated with ICCV’15, we utilized five different deep models trained on images of ImageNet, Place Database and the cultural images supplied by thecompetition organizers. After that, we extracted five complementary deep features and treat them as multi-view data. Combining “early fusion” and “late fusion” strategies described in[7], we achieved one of the best performance and ranked the 2nd place in that challenge. Similar to our work,[9] presented theStacked NN framework to fuse more deep networks at the same time.

Miscellaneous

In real world applications, the data is usually class-imbalanced: some classes have a large number of images/training instances, while some have very limited number of images. As discussed in a recent technique report[10], when deep CNNs are trained on these imbalanced training sets, the results show that imbalanced training data can potentially have a severely negative impact on overall performance in deep networks. For this issue, the simplest method is to balance the training data by directly up-sampling and down-sampling the imbalanced data, which is shown in[10]. Another interesting solution is one kind of special crops processing in our challenge solution[7]. Because the original cultural event images are imbalanced, we merely extract crops from the classes which have a small number of training images, which on one hand can supply diverse data sources, and on the other hand can solve the class-imbalanced problem. In addition, you can adjust the fine-tuning strategy for overcoming class-imbalance. For example, you can divide your own data set into two parts: one contains the classes which have a large number of training samples (images/crops); the other contains the classes of limited number of samples. In each part, the class-imbalanced problem will be not very serious. At the beginning of fine-tuning on your data set, you firstly fine-tune on the classes which have a large number of training samples (images/crops), and secondly, continue to fine-tune but on the classes with limited number samples.

References & Source Links

  1. A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet Classification with Deep Convolutional Neural Networks. In NIPS, 2012

  2. A Brief Overview of Deep Learning, which is a guest post byIlya Sutskever.

  3. CS231n: Convolutional Neural Networks for Visual Recognition ofStanford University, held byProf. Fei-Fei Li andAndrej Karpathy.

  4. K. He, X. Zhang, S. Ren, and J. Sun. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. InICCV, 2015.

  5. B. Xu, N. Wang, T. Chen, and M. Li. Empirical Evaluation of Rectified Activations in Convolution Network. In ICML Deep Learning Workshop, 2015.

  6. N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov.Dropout: A Simple Way to Prevent Neural Networks from Overfitting.JMLR, 15(Jun):1929−1958, 2014.

  7. X.-S. Wei, B.-B. Gao, and J. Wu. Deep Spatial Pyramid Ensemble for Cultural Event Recognition. In ICCV ChaLearn Looking at People Workshop, 2015.

  8. Z.-H. Zhou. Ensemble Methods: Foundations and Algorithms. Boca Raton, FL: Chapman & HallCRC/, 2012. (ISBN 978-1-439-830031)

  9. M. Mohammadi, and S. Das. S-NN: Stacked Neural Networks. Project in Stanford CS231n Winter Quarter, 2015.

  10. P. Hensman, and D. Masko. The Impact of Imbalanced Training Data for Convolutional Neural Networks. Degree Project in Computer Science, DD143X, 2015. 


http://chatgpt.dhexx.cn/article/JteIm1Yr.shtml

相关文章

[读书笔录]解析卷机神经网络(魏秀参)——第二章

解析卷积神经网络——基础理论篇 第二章 卷机神经网络基本部件 2.1 “端到端”思想&#xff08;end-to-end manner&#xff09; 深度学习的一个重要思想即”端到端”的学习方式&#xff0c;属于表示学习的一种。整个学习流程并不进行人为的子问题划分&#xff0c;而是完全交…

[读书笔录]解析卷机神经网络(魏秀参)——第三章

#解析卷积神经网络——基础理论篇 第三章 卷机神经网络经典结构 3.1 CNN网络结构中的重要概念 ####感受野 感受野(receptive filed)原指听觉、视觉等神经系统中一些神经元的特性&#xff0c;即神经元只接受其所支配的刺激区域内的信号。 以单层卷积操作为例&#xff0c;如左…

[读书笔录]解析卷积神经网络(魏秀参)——第一章

解析卷积神经网络——基础理论篇 第一章 卷机神经网络基础知识 1.1发展历程 卷积神经网络发展历史中的第一件里程碑事件发生在上世纪60年代左右的神经科学中&#xff0c;1959 年提出猫的初级视皮层中单个神经元的“感受野”概念&#xff0c;紧接着于1962年发现了猫的视觉中枢…

如何看旷视南京负责人魏秀参跳槽高校工作?

链接&#xff1a;https://www.zhihu.com/question/404733616 编辑&#xff1a;深度学习与计算机视觉 声明&#xff1a;仅做学术分享&#xff0c;侵删 跳槽本是正常现象&#xff0c;之所以会在知乎引起讨论&#xff0c;说明其中有让大家值得关注的点。但我们吃瓜群众&#xff0c…

[阅读笔记]《解析卷积神经网络_深度学习实践手册》魏秀参著

书籍下载地址&#xff1a;http://lamda.nju.edu.cn/weixs/book/CNN_book.pdf 这本书名虽然有实践两个字&#xff0c;但内容还是主要介绍在实践框架中涉及的基本概念、方法和技巧介绍&#xff0c;侧重于实践中的概念介绍并不是手把手告诉你如何在某平台上搭建一个卷积神经网络&a…

CVPR 2020评审结果放出,魏秀参博士教你如何rebuttal!!!

关注上方“深度学习技术前沿”&#xff0c;选择“星标公众号”&#xff0c; 资源干货&#xff0c;第一时间送达&#xff01; 来源&#xff1a;知乎 作者&#xff1a;魏秀参 链接&#xff1a;https://zhuanlan.zhihu.com/p/104298923 近日&#xff0c;在此文章中&#xff0c;旷视…

魏秀参:如何理解全连接层的作用?

点击上方“机器学习与生成对抗网络”&#xff0c;关注"星标" 获取有趣、好玩的前沿干货&#xff01; 来自 | 知乎 作者 | 魏秀参 文仅分享&#xff0c;侵删 https://www.zhihu.com/question/41037974/answer/150522307 全连接层到底什么用&#xff1f;我来谈三点。…

28岁少帅统领旷视南京研究院,LAMDA魏秀参专访

记者和魏秀参专访约在旷视科技北京总部&#xff0c;北临清华大学&#xff0c;西靠中关村的融科写字楼&#xff0c;公司三层中厅是一整面落地屏幕&#xff0c;实时显示着人脸识别、人脸检测、年龄估计、人体关键点预测等多种视觉应用demo。 这家计算机视觉公司在2017年10 月 31 …

极市直播丨南京理工大学魏秀参、沈阳:大规模细粒度图像检索

| 极市线上分享 第102期 | 一直以来&#xff0c;为让大家更好地了解学界业界优秀的论文和工作&#xff0c;极市已邀请了超过100位技术大咖嘉宾&#xff0c;并完成了101期极市线上直播分享。往期分享请前往&#xff1a;http://bbs.cvmart.net/topics/149&#xff0c;也欢迎各位…

[读书笔录]解析卷积神经网络(魏秀参)——目录和绪论

解析卷积神经网络——基础理论篇 *南京大学计算机系机器学习与数据挖掘所&#xff08;LAMDA&#xff09;在读博士魏秀参开放了一份较系统完整的 CNN 入门材料《解析卷积神经网络——深度学习实践手册》。这是一本面向中文读者轻量级、偏实用的深度学习工具书&#xff0c;内容侧…

R Talk | 旷视南京研究院魏秀参:细粒度图像分析综述

「R Talk 」是旷视研究院推出的一个深度学习专栏&#xff0c;将通过不定期的推送展示旷视研究院的学术分享及阶段性技术成果。「R」是 Research 的缩写&#xff0c;也是旷视研究院的内部代号&#xff1b;而所有的「Talk」都是来自旷视的 Researcher。「R Talk 」旨在通过一场场…

【Web】控制台操作

浏览器控制台 重定向空白页面使用调用函数eval()执行函数命令使用时间类函数执行代码通过匿名function()执行脚本创建Function对象并执行代码通过apply执行代码通过call执行函数通过成员对象执行函数通过top执行函数通过WINDOW.WINDOW或任何等价值执行函数通过页面事件执行代码…

浏览器控制台Network面板简述

浏览器控制台Network面板简述 如何打开NetWork面板控制台Network的作用面板组成请求列表请求列表每列&#xff1a;查看单个资源的详细信息查看HTTP头信息General部分&#xff1a;Response Headers&#xff08;响应头&#xff09;部分:Request Headers &#xff08;请求头&#…

浏览器随笔 -- 谷歌浏览器控制台使用

谷歌浏览器控制台使用 1. 使用说明2. 整体布局介绍3. Element3.1 页面元素查找&修改css样式3.2 操作HTML元素 4. Console4.1 日志打印4.2 代码编辑器 5. Sources6. Network6.1 接口状态码6.2 接口传参&数据返回 7. Application -- 浏览器存储7.1 cookie 存储7.2 localS…

如何监听浏览器控制台的打印信息?

注意&#xff1a; 拦截并监听控制台打印并不是一个好的行为&#xff0c;为了网站安全请谨慎使用 需求背景 当我们在项目中引入某些第三方依赖时&#xff0c;该第三方依赖会在浏览器控制台打印相关信息&#xff0c;如 stompjs 会打印 stomp 连接情况 今天有个新需求是需要将控…

Chrome 控制台console的用法

转自&#xff1a;http://www.open-open.com/lib/view/open1421131601390.html 原文出处&#xff1a; ctriphire 大家都有用过各种类型的浏览器&#xff0c;每种浏览器都有自己的特色&#xff0c;本人拙见&#xff0c;在我用过的浏览器当中&#xff0c;我是最喜欢Chrome的&am…

Chrome控制台console的各种用法(方便调试)

https://www.cnblogs.com/qubernet/p/5794812.html 1、输出信息 console.log(消息内容&#xff01;); //输出普通信息   console.info(消息内容&#xff01;); //输出提示信息   console.error(消息内容&#xff01;);//输出错误信息   console.warn(消息内容&#xff0…

chrome控制台中console的强大

在使用谷歌浏览器进行前端开发的时候&#xff0c;console作为控制台的一个主要方法&#xff0c;相信大家都用过&#xff0c;一般都是用console.log()来输出部分内容&#xff0c;但是console还有很多强大之处&#xff0c;下面主要说一下console的更重强大的地方&#xff1a; 首…

浏览器控制台的一些输出方法---console

1.首选是常见的console.log(); console.log(hello);console.log(hi,friend);在浏览器按F12&#xff0c;控制台输出的是这样的 2.在console.log()里使用%s console.log(Hello, my name is %s,kon);%s会替换成第二个参数 3.在console.log()里使用%c console.log(%c styled te…

Edge浏览器调整控制台位置

默认edge浏览器f12控制台是在右边的&#xff0c;但是个人比较喜欢在下面&#xff0c;显示的全面一些 点击右上角的省略号&#xff0c;出现&#xff1a; 注意这几个图标&#xff1a; 点击第三个&#xff0c;就可以把控制台调到下面啦~