This paper proposes application-specific instructions and their bit manipulation unit (BMU), which efficiently support scrambling, convolutional encoding, puncturing, interleaving, and bit stream multiplexing. The proposed DSP employs the BMU supporting parallel shift and XOR (exclusive-OR) operations and bit insertion/extraction operations on multiple data. The proposed architecture has been modeled by VHDL and synthesized using the SEC 0.18 μ m standard cell library and the gate count of the BMU is only about 1700 gates. Performance comparisons show that the number of clock cycles can be reduced about 40 % ∼ 80 % for scrambling, convolutional encoding, and interleaving compared with existing DSPs.