Recently Read Papers

I create this page to write down some notes and reflection about the research paper/blog I have read. Here are some of the most useful ones gathering from the recent project I am working on. They are separated by topics and projects. Some of the titles are followed by my brief comment which describe the mainly reason I choose them to put down here.

The Emotion Recognition

Zheng W L, Lu B L. Investigating critical frequency bands and channels for EEG-based emotion recognition with deep neural networks[J]. IEEE Transactions on Autonomous Mental Development, 2015, 7(3): 162-175.
- Talking about the critical frequency bands in the emotion recognition. These bands are: Delta, Theta, Alpha, Beta and Gamma.
Zong C, Chetouani M. Hilbert-Huang transform based physiological signals analysis for emotion recognition[C]//Signal processing and information technology (isspit), 2009 ieee international symposium on. IEEE, 2009: 334-339.
- Applied Hilbert-Huang Transform into feature extraction when doing emotion recognition.
- Extracted features from both the fission and fuision procedures. Regarded the statistic values like root mean square and the like as the feature of each IMF.

The Sleep Stage Scoring/Detection Based on Biological Signal

Ronzhina M, Janoušek O, Kolářová J, et al. Sleep scoring using artificial neural networks[J]. Sleep medicine reviews, 2012, 16(3): 251-263.
- Like a review of the sleep stage scoring using EEG signals. It contains almost the most-used methods of feature extraction. And separate them into 3 part: time domain, frequency or time-frequency domain, and non-linear feature.
- The most useful part of this paper is the summary of feature. Contains features like Hjorth parameters.
Ren Y, Wu Y. Convolutional deep belief networks for feature extraction of EEG signal[C]//Neural Networks (IJCNN), 2014 International Joint Conference on. IEEE, 2014: 2850-2853.
- Using DBN to extract features from EEG signals. It works like an Auto-encoder.
- The convolutional DBN works with max-pooling layers in order to reduce the dimensions.
Chapotot F, Becq G. Automated sleep–wake staging combining robust feature extraction, artificial neural network classification, and flexible decision rules[J]. International Journal of Adaptive Control and Signal Processing, 2010, 24(5): 409-423.
- Detailed explanation and description of the Candidate feature extraction. It also revel the importance/scores of some features.
- Hjorth activity & mobility, Sigma relative power, Beta relative power and spectral edge frequency 95% are the top-ranked ones.
- The specral edge frequency could be set to 50%. I think this will work better.
Ebrahimi F, Mikaeili M, Estrada E, et al. Automatic sleep stage classification based on EEG signals by using neural networks and wavelet packet coefficients[C]//Engineering in Medicine and Biology Society, 2008. EMBS 2008. 30th Annual International Conference of the IEEE. IEEE, 2008: 1151-1154.
- A great paper talking about sleep stage classification using feature extracted from the energy and Wavelet Package transform of the EEG signal.
- Talk about the selection of Wavelet and number of level. Especially, the paper extract the “mean quadratic value” from the Wavelet coefficient as its energy form.
Losonczi L, Bako L, Brassai S T, et al. Hilbert-Huang transform used for EEG signal analysis[C]//The International Conference Interdisciplinarity in Engineering INTER-ENG. Editura Universitatii” Petru Maior” din Tirgu Mures, 2012: 361.
- Applying Hilbert-Huang Transform to the processing of EEG signal.
Li Y, Yingle F, Gu L, et al. Sleep stage classification based on EEG Hilbert-Huang transform[C]//Industrial Electronics and Applications, 2009. ICIEA 2009. 4th IEEE Conference on. IEEE, 2009: 3676-3681.
- The only merit is describe about the Energy form of Hilbert Huang Transform as features of EEG signals.
Subasi A, Gursoy M I. EEG signal classification using PCA, ICA, LDA and support vector machines[J]. Expert Systems with Applications, 2010, 37(12): 8659-8666. MLA
- Applied PCA, ICA, LDA to decompose the feature of EEG and used the output to do seizure classification.
- The LDA algorothm got the best score.
Dong H, Supratak A, Pan W, et al. Mixed Neural Network Approach for Temporal Sleep Stage Classification[J]. IEEE Transactions on Neural Systems and Rehabilitation Engineering, 2017.
- Newly published paper talked about applying NN aproach to deal with sleep stage classification problem with singal-channel EEG model.
- Applied STFT with moving window of 5 second with 70% overlap to reduce the 30 seconds segment of original EEG signals.
- Calculate PSD by summing up the amplitude value at each frequency of the windows after Fourier transform.
- Extracted the statistic features of the spectral power at each small windows. Because the auther believe taht some features such as the averge and median can reveal some useful biological information.
- After the implementation of this paper’s work in my own job of sleep stage classification, I found that the statistic features generated from the small moving window’s PSD result (where I used Welch’s method with 50 percent overlap to calculate this result instead of the method uesd in the paper) is very useful. These features are able to tell us the information about K complexes, vertex shape waves, sleep spindle and the like.
Nakamura T, Adjei T, Alqurashi Y, et al. Complexity science for sleep stage classification from EEG[C]//Neural Networks (IJCNN), 2017 International Joint Conference on. IEEE, 2017: 4387-4394.
- Improved the classification between N1 and R stage.
- Applied Structural Complexity Analysis on sleep stage classification from EEG.
- Used multi-scale entropy (MSE) method to measure the structural complexity in the EEG series.
- Generated the MSE based on a coarse-grained time series {y(t)}. This is achieved by dividing a given EEG signal into a overlapping/non-overlapping windows of length of t, and by averaging over this window.
- Calculated MSE for evaluating multi-scale complexity using fuzzy entropy (FE) and permutation entropy (PE).
- Since the epoch-base criteria are often dependent on the preceding and follwoing peochs, enlarged the epoch size from 30 seconds to 90 seconds. This means to concatenate the current epoch with the preceding and follwing 30 seconds data for analysis.
- Applied Spectral Edge Frequency of 95%, 50% and the difference between them as features.
Suzuki Y, Sato M, Shiokawa H, et al. MASC: Automatic Sleep Stage Classification Based on Brain and Myoelectric Signals[C]//Data Engineering (ICDE), 2017 IEEE 33rd International Conference on. IEEE, 2017: 1489-1496.
- Proposed a novel method for re-classify sleep stages only for the results with low-confidence.
- Considered the temporal sleep stage transitions.
da Silveira T L T, Kozakevicius A J, Rodrigues C R. Single-channel EEG sleep stage classification based on a streamlined set of statistical features in wavelet domain[J]. Medical & biological engineering & computing, 2017, 55(2): 343-352.
- Described how to extract features from EEG signals in the wavelet domain.
- Used both the DTW and CWT with different features.
- Extracted variance, kurtosis and skewness from the DWT and Renyi’s entropy from CWT.
Hassan A R, Bhuiyan M I H. Computer-aided sleep staging using complete ensemble empirical mode decomposition with adaptive noise and bootstrap aggregating[J]. Biomedical Signal Processing and Control, 2016, 24: 1-10.
- Applied Complete Ensemble Empirical Mode Decomposition with Adaptive Noise to EEG signals.
- Compared the difference between EMD, EEMD and CEEMDAN.
- Calculated the higher oder statistical features from the transformed EEG signals like variance, mean, skewness and kurtosis.
Chambon S, Galtier M, Arnal P, et al. A deep learning architecture for temporal sleep stage classification using multivariate and multimodal time series[J]. arXiv preprint arXiv:1707.03321, 2017.
- Applied deep learning approach (CNNs etc) to learn end-to-end without computing specific features.
- Separated EEG/EOG with EMG to generate multivariate models.
- Tested that one minute of data before and after each data segment offers the best improvement.

Short-term Rainfall Nowcasting

Xingjian S H I, Chen Z, Wang H, et al. Convolutional LSTM network: A machine learning approach for precipitation nowcasting[C]//Advances in neural information processing systems. 2015: 802-810.
- Proposed a new network structure, combined the convolutional network and the LSTM network. It take the advantage of these two structures. It is a good way to analyze the spatio-temporal data, especially with the data which can generate meaningful pictures.
- The convLSTM layers can be found in keras package. It can converge quickly and maintain stable.

Generative Adversarial Networks

Blog on Zhihu discuss about the fundamental theory of GAN and give a simple demo of network.
Blog on Zhihu [1][2] talk about DCGAN and a demo on Github (by keras).
Goodfellow I, Pouget-Abadie J, Mirza M, et al. Generative adversarial nets[C]//Advances in neural information processing systems. 2014: 2672-2680.
- The paper talks about GAN.
- The training procedure for generator G is to maximize the probability of discriminator D making a mistake. In the space of arbitrary function G and D, a unique solution exists, with G recovering the training data distribution and D equal to 1/2 everywhere.
- The GAN model is no need for any Markov Chains or unrolled approximate inference networks during either training or generation of samples.
- When training the model, G must not be trained too much without updating D, in order to avoid “the Helvetica scenario” in which G collapses too many values of z to the same value of x to have enough diversity to model p_data.
- Inner loop to optimize the discriminator during training is unallowed for computation. Plus, finite data set will lead to overfitting. So we can do an optimization operator for D after several steps of the optimization of G. As we update G slowly, D can remain at the nearby of the optimal solution.
- Summary on WeChat Blog.
Salimans T, Goodfellow I, Zaremba W, et al. Improved techniques for training gans[C]//Advances in Neural Information Processing Systems. 2016: 2234-2242.
- Useful techniques for training or modifying the GAN framework.
- Explained that training GANs requires finding a Nash equilibrium of a non-convex game with continuous, high-dimensional parameters. However, the simple gradient-based method may fail to converge.
- Feature matching: specifying a new objective for the generator that prevents it from overtraining on the current discriminator. This objective requires te generator to generate data that matches the expected value of the features on an intermediate layer of the discriminator.
- Minibatch discrimination: to avoid the independence when the discriminator processes each example and its consequence of leading to less diversity of the generator, add the closeness of each examples as additional features.
- Historical averaging: adding a normalization-like term in the cost function in order to find the equilibria.
- One-sided label smoothing: tuned the 0 and 1 targets for a classifier with smoothed values, like 0.9 for the original 1.
- Virtual batch normalization: normalize each example based on the statistics collected on a fixed batch but not the original batch itself.

Optimization Theory

Blog of Sebastian Ruder’s blog/arxiv an overview of gradient descent optimization algorithm.
- A very good blog describe the most popular optimization methods especially the gradient algorithm.
Norm in the Machine Learning blog
- Discussed about the L0, L1 and L2 norm in the Machine learning.
When to use Adam and when to use SGD:https://zhuanlan.zhihu.com/p/32338983
- Detailed explain the improvement and purpose of each optimal function.
- Discussed when and how to use Adam and SGD.

Computer Vision

Lecture slides of CSE/EE 486. Instructor: Robert Collins of CSE Department, Penn State University.
Lecture slides of CS 131 Computer Vision: Foundations and Applications, Fall 2016-2017. Instructors: Prof. Fei-Fei Li and Dr. Juan Carlos Niebles of CS Department, Stanford Univ.
Lowe D G. Object recognition from local scale-invariant features[C]//Computer vision, 1999. The proceedings of the seventh IEEE international conference on. Ieee, 1999, 2: 1150-1157.
- Proposed the Scale-Invariant-Feature-Transform algorithm in order to extract local features from images.
- Staged filterd approach. Get an smoothed image and then the Difference of Gaussian (DoG) of the filtered images by taking difference of images with different sigma.(The DoG procedure can be replaced by LoG)
- Select the extremum point in the local area by comparing the value of DoG of this point with the nearby 8 points surrounding it and the other 9 points on the relative filterd image.
- Get the direction and magnitude of the gradient on that key point.
- Generate the key point descriptor by building a 128-dimension normalized vector.
- It is better to read this blog in Chinese and the Wiki as supplementation.
He K, Zhang X, Ren S, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 770-778.
- Reference: Discussion about the Residual Network
- Briefly discussed about the algorithm and advantages utilized in the ResNet.
- Conducted the propagation in the residual network.
- Showed the topological explanation of the shortcut structure in the residual network.

Natural Language Processing

Prabhavalkar R, Sainath T N, Li B, et al. An analysis of “attention” in sequence-to-sequence models,”[C]//Proc. of Interspeech. 2017.
- Summarized the attention-based models used in automatic speech recognition (ASR).
- Discussed the Full-Sequence attention and Limited-Sequence attention models.
- Discussed two methods of computing attention: dot-product attention and tanh attention.
Smoothing algorithm for the N-gram model in NLP.
- Reference: 自然语言处理中N-Gram模型的Smoothing算法
- Briefly described the smoothing algorithm and model for the N-grams in NLP. From the most simple one (Laplace method) to the most powerful one (Kneser-Ney Smoothing).

Machine Learning / Deep Learning Theory

Zhou Z H, Feng J. Deep forest: Towards an alternative to deep neural networks[J]. arXiv preprint arXiv:1702.08835, 2017.
- Proposed a novel deep-model: using adeptive cascaded random/completed random forest to build a model.
- Fewer hyper-parameters to be tuned. Its depth can be generated adaptively when training. What is more, because the BP algorithm does not applied in this model (the connection between cascaded forest, i.e. layers, only contains the output from the former one), the algorithm works with less computational facilitys.
- Applied two methods in this model: one is cascade, which learns from the connection from deep neural network; the other one is multi-grained scaling, which is like the convolution method in CNN.
- A very good implementation of ensemble. It can be used in a lot of data competition.
- Work great in small-size data set.
Softmax function and cross entropy, https://zhuanlan.zhihu.com/p/27223959.
- Explicated the Maximun Likelihood Estimation, Softmax function, Relative Entropy (Kullback-Leibler Divergence) and Cross Entropy.
- Deduced the derivation of these functions and explained their relationship.
Chen T, Guestrin C. Xgboost: A scalable tree boosting system[C]//Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. ACM, 2016: 785-794.
- The paper describe the XGBoost model.
Liu F T, Ting K M, Zhou Z H. Isolation forest[C]//Data Mining, 2008. ICDM’08. Eighth IEEE International Conference on. IEEE, 2008: 413-422.
- Proposed a novel technology for abnormal detection using usupervised methods.
- The Isolation Forest is based on the noticable difference between the abnormal point values and the normal ones. And this is the distinguishable feature for the binary tree model to saperate the unusual point with few steps.
- As a personal experience, this algorithm is fast, and has a better performance than One-class SVM.
Zaremba W, Sutskever I, Vinyals O. Recurrent neural network regularization[J]. arXiv preprint arXiv:1409.2329, 2014.
- Discussed how to apply dropout method into the RNN, especially LSTM mode.
- The main idea is to apply the dropout operator only to the non-recurrent connection.
- The dropout operator corrupts the information carried by the units, forcing them to perform their intermediate computations more robustly.
- The information is corrupted by the dropout operator exactly L+1 times and this number is independent of the number of timesteps traversed by the information.
- By not using dropout on the recurrent connections, the LSTM can benefit from dropout regularization without sacrificing its valuable memorization ability.
Backpropagation algorithm: discussion in Zhihu