Аннотация:We present a deep convolutional recurrent neural network for speech emotion recognition based on the log-Mel filterbank energies, where the convolutional layers are responsible for the discriminative feature learning. Based on the hypothesis that a better understanding of the internal configuration within an utterance would help reduce misclassification, we further propose a convolutional attention mechanism to learn the utterance structure relevant to the task. In addition, we quantitatively measure the performance gain contributed by each module in our model in order to characterize the nature of emotion expressed in speech. The experimental results on the eNTERFACE'05 emotion database validate our hypothesis and also demonstrate an absolute improvement by 4.62% compared to the state-of-the-art approach.