A high-level optimization methodology is applied for implementing the well-known convolutional face finder (CFF) algorithm for real-time applications on mobile phones, such as teleconferencing, advanced user interfaces, image indexing, and security access control. CFF is based on a feature extraction and classification technique which consists of a pipeline of convolutions and subsampling operations. The design of embedded systems requires a good trade-off between performance and code size due to the limited amount of available resources. The followed methodology copes with the main drawbacks of the original implementation of CFF such as floating-point computation and memory allocation, in order to allow parallelism exploitation and perform algorithm optimizations. Experimental results show that our embedded face detection system can accurately locate faces with less computational load and memory cost. It runs on a 275 MHz Starcore DSP at 35 QCIF images/s with state-of-the-art detection rates and very low false alarm rates.