To address multiple motions and deformable objects' motions encountered in existing region-based approaches, an automatic video object (VO) segmentation methodology is proposed in this paper by exploiting the duality of image segmentation and motion estimation such that spatial and temporal information could assist each other to jointly yield much improved segmentation results. The key novelties of our method are (1) scale-adaptive tensor computation, (2) spatial-constrained motion mask generation without invoking dense motion-field computation, (3) rigidity analysis, (4) motion mask generation and selection, and (5) motion-constrained spatial region merging. Experimental results demonstrate that these novelties jointly contribute much more accurate VO segmentation both in spatial and temporal domains.