Chinese
Spam Filter Rules - Chinese_rules.cf
Quang-Anh Tran, CCERT 2004-2005
1. What is Chinese_rules.cf
2. Background theory of Chinese_rules.cf
3. Framework for creating and using Chinese_rules.cf
4. Matching speed of Chinese_rules.cf
5. Performance of Chinese_rules.cf
6. User statistics of Chinese_rules.cf
7. How to use Chinese_rules.cf
1. What is Chinese_rules.cf
Chinese_rules.cf
is a third party drop-in custom rule set for SpamAssassin
to catch spam written in Chinese. Due to there is no rule
for Chinese mail before, SpamAssassin can not catch Chinese
spam effectively. Chinese_rules.cf is the first rule set to
catch Chinese spam for SpamAssassin. The Chinese_rules.cf
is built based on a very new and luxuriant Chinese spam database
own by CCERT. It is updated once a week, therefore, it is
able to catch very new spam.
Chinese_rules.cf is the first rule set on the official website
of SpamAssassin to catch Chinese spaml. If user uses Google,
Yahoo, Baidu, MSN search engines to search for “Chinese spam”
in Chinese, the first result is relating to the Chinese_rules.cf.
2. Background theory of Chinese_rules.cf
Chinese_rules.cf is a content-based filter rule set. Spam
detections fall into two categories: rule-based and statistical-based.
The former refers to the detection which is performed by looking
for a spam-liked pattern in an email, e.g. subject contains
“Free”. The statistical-based, on the other hand, try to solve
a two-class categorization problem; it uses a training dataset
of spam and ham to train the detector. Bayesian is the most
widely used for statistical-based spam filtering.
The advantage of rule-based method is that the rules can be
shared; therefore, the knowledge of spam can be popularized
quickly. A rule created by someone can be shared to others.
We call it as “space characteristic” is good. The rules, however,
are built manually it is hard to keep them up with the changes
of spam. We call it as “time characteristic” is bad.
The statistical method, on the other hand, is possible to
make the detector retrained quickly, as long as the training
dataset updated in time, the detector can be kept up with
the changes of spam. Therefore, the “time characteristic”
of this method is very good. The disadvantage of this method
is that the knowledge of detector is unable to be used among
servers. Therefore, the “space characteristic” of this method
is bad.
Chinese_rules.cf is built by a trade-off between rule-based
and statistical-based methods, we call it as statistical rule-based
method. In this method, the rules are created automatically
by a statistical method. This method have all the advantages
of the rule-based and the statistical-based method: Since
it is a kind of rules, its “space characteristic” is good;
since the rules are created automatically, its “time characteristic”
is good. A comparison in theory between Chinese_rules.cf and
the traditional methods is shown in table 1.
Table 1. Chinese_rules.cf vs. traditional
methods
| |
Space characteristic |
Time characteristic |
| Rule-based |
Good |
Bad |
| Statistical-based |
Bad |
Good |
| Chinese_rules.cf |
Good |
Good |
CCERT was found in 1998, its anti-spam service deals with
a great deal of spam report, and it owns a very new and luxuriant
Chinese spam database. The Chinese_rules.cf is created automatically
based on such a database.
3. Framework for creating and using Chinese_rules.cf
Figure 1 shows the framework for creating and using Chinese_rules.cf.
Making use of the CCERT anti-spam service and the feed back
information from users, we manage a new and luxuriant Chinese
spam and ham database. A released version of this database
is CDSCE. A statistical method
is used to create automatically the Chinese_rules.cf based
on the spam and ham database. Since the database is new, the
Chinese_rules.cf has the most new knowledge of spam, in other
words, its “time characteristic” is good. The Chinese_rules.cf
is put on the CCERT website for users from different places
around the world to download and use, therefore, its “space
characteristic” is good.
Figure 1. Framework for creating and using
Chinese_rules.cf
4. Matching speed of Chinese_rules.cf
Usually, Chinese_rules.cf consists of around 500 rules. This
number may lead to a suspicion of matching speed. However,
the Chinese_rules.cf is proved to have an efficient matching
speed in theory as well as in experiment. The reasons are:
1) Rules in Chinese_rules.cf are simple; they are short of
strings without any wildcard. Such simple rules have much
faster matching speed than complicated rules; 2) 90% rules
in Chinese_rules.cf are subject rules, only 10% are body rules.
Since subject parts are always short, the matching speed of
Chinese_rules.cf is quite fast.
In experiment, we use a common PC (P4 2.8G CPU). The Chinese_rules.cf
(updated 2004 Dec 21) was used to match 178482 emails. The
result is that it takes 0.04 second to scan an email with
average size of 5.0 K (attachment not counted). This result
is very remarkable because it means using a common PC can
solve 2.16 millions emails per day. In general, an email server
for college students sends and receives about 0.4 millions
email per day. In other words, adding a common PC computation
capacity can meet the need for filtering spam of a college
student email server.
5. Performance of Chinese_rules.cf
The Chinese_rules.cf is updated once a week. For each version,
we put the experimental performance of that version in comments.
Table 2 shows the performance of Chinese_rules.cf.
Chinese_rules.cf, updated 2006 Oct 1
|
| Threshold | Spam recall (121824) | Ham error (207961) |
|
| 0.5 | 82.8% | 1.3% |
| 1.0 | 78.2% | 0.5% |
| 1.5 | 74.5% | 0.2% |
| 2.0 | 71.6% | 0.1% |
| 2.5 | 69.4% | 0.0% |
| 3.0 | 67.3% | 0.0% |
| 3.5 | 64.6% | 0.0% |
| 4.0 | 62.6% | 0.0% |
| 4.5 | 61.6% | 0.0% |
|
It takes 0.0165 seconds to scan an email with size 1434.4620 (P4-2.8G CPU)
The performance is computed when using only the Chinese_rules.cf
and the dataset is only Chinese email. In practice, Chinese_rules.cf
is always put together with other default rules of SpamAssassin.
Some default rules in SpamAssassin that describe the behavior
of sending email may also match the Chinese spam, so that
the performance in practice may even be better.
Note, for a big email server in China, i.e. 0.4 millions emails
per day, the acceptable performance is that recall of spam
is larger than 90% while error of ham is smaller than 1%
6. User statistics of Chinese_rules.cf
The Chinese_rules.cf was first released on 2004 Sept 7th.
The user statistics of Chinese_rules.cf are shown in figure
2 and 3. Figure 2 shows number of IP accesses the Chinese_rules.cf
per month. We can see that the reputation of Chinese_rules.cf
is increasing.

Figure 2. Number of IP accessing per month
Figure 3 shows number of individual IP of Unix/Linux server
that uses the Chinese_rules.cf,the part with pattern indecates
old user, i.e. their IP appeared last month.
Figure 3. Number of individual IP of server
that uses Chinese_rules.cf
7. How to use Chinese_rules.cf
Download Chinese_rules.cf
to SpamAssassin rule directory (usually in /usr/share/spamassassin).
We can do it with wget:
# wget -N -P /usr/share/spamassassin www.ccert.edu.cn/spam/sa/Chinese_rules.cf
Each time update Chinese_rules.cf, we need to restart SpamAssassin.
Here is a way to restart spamd:
# ps –ax | grep spamd
And look for PID of spamd process, then
# kill -HUP PID
If you use mimedefang, then you need to restart mimedefang。Suppose,
the script to restart mimedefang is /etc/init.d/init-script,
then the command line is as follow:
# /etc/init.d/init-script restart
The Chinese_rules.cf is updated every week. Please note that
rules as well their scores are updated based on the Chinese
spam database in the last 3 month. Updating Chinese_rules.cf
frequently may make SpamAssassin catch Chinese spam more effectively.
To do that, put the scripts to download Chinese_rules.cf and
to restart mimedefang in crontab. Suppose you need to update
once a month, the crontab of root should have the following
row:
0 0 1 * * wget -N -P /usr/share/spamassassin www.ccert.edu.cn/spam/sa/Chinese_rules.cf;
/etc/init.d/init-script restart
|