|-Anti-Spam
        |-Chinese_rules.cf

    中文

Chinese Spam Filter Rules - Chinese_rules.cf

Quang-Anh Tran, CCERT 2004-2005

 

1. What is Chinese_rules.cf

2. Background theory of Chinese_rules.cf

3. Framework for creating and using Chinese_rules.cf

4. Matching speed of Chinese_rules.cf

5. Performance of Chinese_rules.cf

6. User statistics of Chinese_rules.cf

7. How to use Chinese_rules.cf


1. What is Chinese_rules.cf

Chinese_rules.cf is a third party drop-in custom rule set for SpamAssassin to catch spam written in Chinese. Due to there is no rule for Chinese mail before, SpamAssassin can not catch Chinese spam effectively. Chinese_rules.cf is the first rule set to catch Chinese spam for SpamAssassin. The Chinese_rules.cf is built based on a very new and luxuriant Chinese spam database own by CCERT. It is updated once a week, therefore, it is able to catch very new spam.
Chinese_rules.cf is the first rule set on the official website of SpamAssassin to catch Chinese spaml. If user uses Google, Yahoo, Baidu, MSN search engines to search for “Chinese spam” in Chinese, the first result is relating to the Chinese_rules.cf.

2. Background theory of Chinese_rules.cf

Chinese_rules.cf is a content-based filter rule set. Spam detections fall into two categories: rule-based and statistical-based. The former refers to the detection which is performed by looking for a spam-liked pattern in an email, e.g. subject contains “Free”. The statistical-based, on the other hand, try to solve a two-class categorization problem; it uses a training dataset of spam and ham to train the detector. Bayesian is the most widely used for statistical-based spam filtering.
The advantage of rule-based method is that the rules can be shared; therefore, the knowledge of spam can be popularized quickly. A rule created by someone can be shared to others. We call it as “space characteristic” is good. The rules, however, are built manually it is hard to keep them up with the changes of spam. We call it as “time characteristic” is bad.
The statistical method, on the other hand, is possible to make the detector retrained quickly, as long as the training dataset updated in time, the detector can be kept up with the changes of spam. Therefore, the “time characteristic” of this method is very good. The disadvantage of this method is that the knowledge of detector is unable to be used among servers. Therefore, the “space characteristic” of this method is bad.
Chinese_rules.cf is built by a trade-off between rule-based and statistical-based methods, we call it as statistical rule-based method. In this method, the rules are created automatically by a statistical method. This method have all the advantages of the rule-based and the statistical-based method: Since it is a kind of rules, its “space characteristic” is good; since the rules are created automatically, its “time characteristic” is good. A comparison in theory between Chinese_rules.cf and the traditional methods is shown in table 1.

Table 1. Chinese_rules.cf vs. traditional methods

  Space characteristic Time characteristic
Rule-based Good Bad
Statistical-based Bad Good
Chinese_rules.cf Good Good

CCERT was found in 1998, its anti-spam service deals with a great deal of spam report, and it owns a very new and luxuriant Chinese spam database. The Chinese_rules.cf is created automatically based on such a database.

3. Framework for creating and using Chinese_rules.cf

Figure 1 shows the framework for creating and using Chinese_rules.cf. Making use of the CCERT anti-spam service and the feed back information from users, we manage a new and luxuriant Chinese spam and ham database. A released version of this database is CDSCE. A statistical method is used to create automatically the Chinese_rules.cf based on the spam and ham database. Since the database is new, the Chinese_rules.cf has the most new knowledge of spam, in other words, its “time characteristic” is good. The Chinese_rules.cf is put on the CCERT website for users from different places around the world to download and use, therefore, its “space characteristic” is good.

Figure 1. Framework for creating and using Chinese_rules.cf

4. Matching speed of Chinese_rules.cf

Usually, Chinese_rules.cf consists of around 500 rules. This number may lead to a suspicion of matching speed. However, the Chinese_rules.cf is proved to have an efficient matching speed in theory as well as in experiment. The reasons are: 1) Rules in Chinese_rules.cf are simple; they are short of strings without any wildcard. Such simple rules have much faster matching speed than complicated rules; 2) 90% rules in Chinese_rules.cf are subject rules, only 10% are body rules. Since subject parts are always short, the matching speed of Chinese_rules.cf is quite fast.
In experiment, we use a common PC (P4 2.8G CPU). The Chinese_rules.cf (updated 2004 Dec 21) was used to match 178482 emails. The result is that it takes 0.04 second to scan an email with average size of 5.0 K (attachment not counted). This result is very remarkable because it means using a common PC can solve 2.16 millions emails per day. In general, an email server for college students sends and receives about 0.4 millions email per day. In other words, adding a common PC computation capacity can meet the need for filtering spam of a college student email server.

5. Performance of Chinese_rules.cf

The Chinese_rules.cf is updated once a week. For each version, we put the experimental performance of that version in comments. Table 2 shows the performance of Chinese_rules.cf.

Chinese_rules.cf, updated 2006 Oct 1

ThresholdSpam recall
(121824)
Ham error
(207961)

0.582.8%1.3%
1.078.2%0.5%
1.574.5%0.2%
2.071.6%0.1%
2.569.4%0.0%
3.067.3%0.0%
3.564.6%0.0%
4.062.6%0.0%
4.561.6%0.0%

It takes 0.0165 seconds to scan an email with size 1434.4620 (P4-2.8G CPU)

The performance is computed when using only the Chinese_rules.cf and the dataset is only Chinese email. In practice, Chinese_rules.cf is always put together with other default rules of SpamAssassin. Some default rules in SpamAssassin that describe the behavior of sending email may also match the Chinese spam, so that the performance in practice may even be better.
Note, for a big email server in China, i.e. 0.4 millions emails per day, the acceptable performance is that recall of spam is larger than 90% while error of ham is smaller than 1%

6. User statistics of Chinese_rules.cf

The Chinese_rules.cf was first released on 2004 Sept 7th. The user statistics of Chinese_rules.cf are shown in figure 2 and 3. Figure 2 shows number of IP accesses the Chinese_rules.cf per month. We can see that the reputation of Chinese_rules.cf is increasing.

Figure 2. Number of IP accessing per month

Figure 3 shows number of individual IP of Unix/Linux server that uses the Chinese_rules.cf,the part with pattern indecates old user, i.e. their IP appeared last month.

Figure 3. Number of individual IP of server that uses Chinese_rules.cf

7. How to use Chinese_rules.cf

Download Chinese_rules.cf to SpamAssassin rule directory (usually in /usr/share/spamassassin). We can do it with wget:

# wget -N -P /usr/share/spamassassin www.ccert.edu.cn/spam/sa/Chinese_rules.cf

Each time update Chinese_rules.cf, we need to restart SpamAssassin. Here is a way to restart spamd:

# ps –ax | grep spamd
And look for PID of spamd process, then
# kill -HUP PID

If you use mimedefang, then you need to restart mimedefang。Suppose, the script to restart mimedefang is /etc/init.d/init-script, then the command line is as follow:

# /etc/init.d/init-script restart

The Chinese_rules.cf is updated every week. Please note that rules as well their scores are updated based on the Chinese spam database in the last 3 month. Updating Chinese_rules.cf frequently may make SpamAssassin catch Chinese spam more effectively. To do that, put the scripts to download Chinese_rules.cf and to restart mimedefang in crontab. Suppose you need to update once a month, the crontab of root should have the following row:

0 0 1 * * wget -N -P /usr/share/spamassassin www.ccert.edu.cn/spam/sa/Chinese_rules.cf; /etc/init.d/init-script restart