CCERT Data Sets of Chinese Emails (CDSCE)
Quang-Anh Tran, CCERT 2005
1. Overview
2. Downloads
3. Spam and Ham Collection
Diagram
4. Data Sets Description
5. LICENSING TERMS
1. Overview
After the release of Chinese_rules.cf,
a third party drop-in custom rule set for SpamAssassin
to catch spam written in Chinese, many researchers have asked
us for using our Chinese spam database. Therefore, we decided
to release a part of our database as data sets of Chinese
e-mails (CDSCE) for the purpose of research. They are of interest
to all researchers working on the general problem of anti-spam.
2. Downloads
CDSCE is available to provide researchers with extensive
samples of spam and ham written in Chinese.
3. Spam and ham collection diagram
Figure 1 shows the diagram for generating the data sets.
We use a honeypot technique (SPAMPOT) to collect spam. Every
emails sent to xxx@ccert.edu.cn, where xxx is any string,
are collected. Ham message are collected from Chinese public
forums. Messages are first put into a transition database
for pre-processing. We check the spam and ham messages manually
before putting into our database of Chinese emails. The released
data sets is a part of the database.
Figure 1. Diagram for collecting spam and
ham
4. Data sets description
We removed all html tags in the body part and kept only the
plain text part of each message, but we remained the 'Content-type'
header unchanged (it may be useful), therefore, the 'Content-type'
header may be inconsistent with the body part. A ham message
is composed by a post (from forum) and a raw header of a ham
messages. Note, the raw headers of message were kept unchanged
except all the email addresses were replaced by a random address.
The 2005-Jun data set contains 25088 spam and 9272 ham, the
2005-Jul data set contains 20308 spam and 9042 ham. We use
SpamAssassin-3.0.4 and Chinese_rules.cf to score messages.
The probability density curves of the data sets are shown
in figure 2.
 
Figure 2. Probability density curves of the
2005-Jun (left) and 2005-Jul (right)
5. LICENSING TERMS
Copyright 2005 CERNET Computer Emergency Response Team (CCERT)
The CDSCE is granted free of charge for research and education
purposes. However you must obtain a license from CCERT to
use it for commercial purposes.
Scientific results produced using the CDSCE provided shall
acknowledge the use of CDSCE. Please send feedback information
to qa@ccert.edu.cn
The CDSCE must not be modified and distributed without prior
permission of CCERT.
By using CDSCE you agree to the licensing terms.
|