Gesture in multimodal researches has been studied widely recently, and how gesture interacts with speech in communication is the focus in most researches. Some hypotheses or models about production and interaction between gesture and speech are introduced and compared in this paper. We find that it is generally agreed that speech production mechanism can be explained based on Levelt’s Model; while there is no consistency about gesture production and the interaction between gesture and speech. Most of theories argue that gesture stems from the visual-spatial images in working memory; some models approve of the interactive relationship while others consider no interaction between gesture and speech. Further research will be made in the areas of theoretical and applicative aspects.