Static video abstract is able to show the dynamic event of semantics. According to the current research of events, the method of generating speaking video summary by detecting face in moving region is proposed. Extracting face-based frame and its moving region, combined with the trained face classifier, the speaking event is detected. Integrated video temporal and spatial characteristics, this method expresses the visual impact of video contents effectively. Experimental results indicate the generated summary with good effects.