In biological sequence research, the positional weight matrix (PWM) is often used to search for putative transcription factor binding sites. A log-odd score is usually applied to measure the closeness of a subsequence to the PWM. However, the log-odd score is motif-length-dependent and thus there is no universally applicable threshold. In this paper, we propose an alternative scoring index (G) varying from zero, where the subsequence is not much different from the background, to one, where the subsequence fits best to the PWM. We also propose a measure evaluating the statistical expectation at each G index. We investigated the PWMs from the TRANSFAC and found that the statistical expectation is significantly ( p < 0.0001) correlated with both the length of the PWMs and the threshold G value. We applied this method to two PWMs (GCN4_C and ROX1_Q6) of yeast transcription factor binding sites and two PWMs (HIC1-02, HIC1_03) of the human tumor suppressor (HIC-1) binding sites from the TRANSFAC database. Finally, our method compares favorably with the broadly used Match method. The results indicate that our method is more flexible and can provide better confidence.
Proceedings of the International MultiConference of Engineers and Computer Scientists 2008 (IMECS'08) at the 2008 IAENG International Conference on Bioinformatics (ICB'08), March 19-21, 2008, Hong Kong, China.