|-Anti-Spam
        |-Data sets

    

CCERT Data Sets of Chinese Emails (CDSCE)

Quang-Anh Tran, CCERT 2005

 

1. Overview

2. Downloads

3. Spam and Ham Collection Diagram

4. Data Sets Description

5. LICENSING TERMS


1. Overview

After the release of Chinese_rules.cf, a third party drop-in custom rule set for SpamAssassin to catch spam written in Chinese, many researchers have asked us for using our Chinese spam database. Therefore, we decided to release a part of our database as data sets of Chinese e-mails (CDSCE) for the purpose of research. They are of interest to all researchers working on the general problem of anti-spam.

2. Downloads

CDSCE is available to provide researchers with extensive samples of spam and ham written in Chinese.

3. Spam and ham collection diagram

Figure 1 shows the diagram for generating the data sets. We use a honeypot technique (SPAMPOT) to collect spam. Every emails sent to xxx@ccert.edu.cn, where xxx is any string, are collected. Ham message are collected from Chinese public forums. Messages are first put into a transition database for pre-processing. We check the spam and ham messages manually before putting into our database of Chinese emails. The released data sets is a part of the database.

Figure 1. Diagram for collecting spam and ham

4. Data sets description

We removed all html tags in the body part and kept only the plain text part of each message, but we remained the 'Content-type' header unchanged (it may be useful), therefore, the 'Content-type' header may be inconsistent with the body part. A ham message is composed by a post (from forum) and a raw header of a ham messages. Note, the raw headers of message were kept unchanged except all the email addresses were replaced by a random address.

The 2005-Jun data set contains 25088 spam and 9272 ham, the 2005-Jul data set contains 20308 spam and 9042 ham. We use SpamAssassin-3.0.4 and Chinese_rules.cf to score messages. The probability density curves of the data sets are shown in figure 2.

Figure 2. Probability density curves of the 2005-Jun (left) and 2005-Jul (right)

5. LICENSING TERMS

Copyright 2005 CERNET Computer Emergency Response Team (CCERT)

The CDSCE is granted free of charge for research and education purposes. However you must obtain a license from CCERT to use it for commercial purposes.

Scientific results produced using the CDSCE provided shall acknowledge the use of CDSCE. Please send feedback information to qa@ccert.edu.cn

The CDSCE must not be modified and distributed without prior permission of CCERT.

By using CDSCE you agree to the licensing terms.