Summary: | Large-scale spam emails cause a serious waste of time and resources. This paper investigates the text features of email documents and the feature noises among multi-field texts, resulting in an observation of a power law distribution of feature strings within each text field. According to the observation, we propose an efficient filtering approach including a compound weight method and a lightweight field text classification algorithm. The compound weight method considers both the historical classifying ability of each field classifier and the classifying contribution of each text field in the current classified email. The lightweight field text classification algorithm straightforwardly calculates the arithmetical average of multiple conditional probabilities predicted from feature strings according to a string-frequency index for labeled emails storing. The string-frequency index structure has a random-sampling-based compressible property owing to the power law distribution and can largely reduce the storage space. The experimental results in the TREC spam track show that the proposed approach can complete the filtering task in low space cost and high speed, whose overall performance 1-ROCA exceeds the best one among the participators at the trec07p evaluation.
|