C++ is "C with classes." To explore C++ classes, as well as do a little software engineering, our second "real, quick" program will be a simple text classification tool that could be used, for example, to distinguish between "spam" (unwanted email) and "ham" (email we want). The basic idea is that texts from different classifications will have different character frequency characteristics (spam email from sex-sites might have lots of "x's" in them, for example). First, we'll train our text classifier on texts which have been marked as which class they belong too (We'll call this program ilearn, for "information-based learning"). Second, we'll test our classifier by seeing if it can distinguish between ham and spam. We'll call this program hamspam.
Here are the programs in action. We train the classifiers on files in ham and spam directories, and then test them on files in separate directories:
% cat /ham/ham.0/* | ilearn Ham > ham.txt % cat /spam/spam.0/* | ilearn Spam > spam.txt % hamspam ham.txt spam.txt /ham/ham.1/* Total: 500 Ham: 497 Spam: 3 % hamspam ham.txt spam.txt /spam/spam.1/* Total: 500 Ham: 244 Spam: 256
The spam filter is about 50% effective, and only occasionally mis-classifies a real email as a piece of spam. This is somewhat impressive for something so simple. Not impressive enough to use on a day-to-day basis, perhaps, but impressive enough.
Given the example above, a text classifier needs at least two methods: a learn method (for training the classifier) and some kind of score method that will score a text against what it's learned. It would also be useful to dump training data and read it back again. Because we're basing our classifier on single characters ("unigrams"), we'll call the class UnigramTextClassifier.
Here is the code for hamspam.cpp:
#include <iostream> #include "UnigramTextClassifier.h" using namespace std; using namespace TextClassifier; int main(int argc,char *argv[]) { UnigramTextClassifier ham = UnigramTextClassifier(); UnigramTextClassifier spam = UnigramTextClassifier(); ham.read(argv[1]); spam.read(argv[2]); int hams = 0; int spams = 0; int total = 0; for (int i=3; i<argc; i++) { total++; if (ham.score(argv[i]) > spam.score(argv[i])) hams++; else spams++; } cout << "\nTotal: " << total << " Ham: " << hams << " Spam: " << spams << endl; }
In addition to the #include <iostream> statement, there is this statement: #include "UnigramTextClassifier.h". It is standard, in writing C++ programs, to place function and class declarations in a separate file so that the program linker knows how to link object files together. When the #included file is given in double quotation marks, the definitions are looked for locally (usually in the same directory). We'll get to the definitions soon; we are seeing how these definitions are used in new programs.
The main function, in this case, has two arguments, argc and argv. The variable argc contains the number of arguments given; it is at least 1 (because the program name is considered an argument). The variable argv is an array of C-style strings which are the arguments (the zeroth string is the program name).
In our short program, which does no error checking, argv[1] is the filename for the dumped ham frequencies, and argv[2] is the filename for the dumped spam frequencies. The remaining entries in argv (that is, from 3 to argc-1), contain filenames to check. For each file, if the score of the ham classifier is greater than the spam classifier, the number of purported ham files is incremented; otherwise the number of purported spam files is incremented. After all the files are processed, the totals are printed.
The contents of a header file for a class contain the methods and data members of the class. Here's the header file for UnigramTextClassifier. There are a few things of note to start with: notice how some blocks are declared public and others private (I tend to follow the convention that data members should always be private and accessed through methods). Also notice that the class block ends in a semi-colon; leaving this out is a standard novice mistake.
#ifndef UnigramTextClassifier_H #define UnigramTextClassifier_H #include <map> #include <iostream> #include <fstream> namespace TextClassifier { typedef map<unsigned char,unsigned long> frequency_map; class UnigramTextClassifier { public: UnigramTextClassifier(); UnigramTextClassifier(const string classification); frequency_map freqs() { return my_freqs; } unsigned long corpus_total() { return my_corpus_total; } unsigned long total() { return my_total; } string classification() { return my_classification; } void setClassification(string& classification) {my_classification = classification;} void UnigramTextClassifier::learn(istream& in); void UnigramTextClassifier::learn(char* in); void UnigramTextClassifier::dump(ostream& out); void UnigramTextClassifier::dump(char* out); void UnigramTextClassifier::read(istream& in); void UnigramTextClassifier::read(char* in); float UnigramTextClassifier::score(istream& in); float UnigramTextClassifier::score(char* in) ; float UnigramTextClassifier::bits_required(unsigned char ch); float UnigramTextClassifier::bits_required(istream& in); float UnigramTextClassifier::bits_required(char* in); private: /* internal character->frequency map */ frequency_map my_freqs; /* internal total number of characters in corpus */ unsigned long my_corpus_total; /* internal total number of characters in text */ unsigned long my_total; /* internal name of classifer */ string my_classification; /* internal base-2 logarithm */ float UnigramTextClassifier::lg (float n); /* internal information value function -lg(n) */ float UnigramTextClassifier::info_value(float n); /* internal current time stream */ string UnigramTextClassifier::ctime_string(); }; } using namespace std; #endif /* UnigramTextClassifier_H_ */
Four other things of note:
Given this header file, what does the class code look like? Here's the code: it's not very trickly code.
#include <time.h> #include "UnigramTextClassifier.h" using namespace std; UnigramTextClassifier::UnigramTextClassifier() { my_total = 0; my_corpus_total = 0; my_classification = "Unknown"; } UnigramTextClassifier::UnigramTextClassifier(const string classification) { my_total = 0; my_corpus_total = 0; my_classification = classification; } void UnigramTextClassifier::learn(istream& in) { char ch; // note: in.get requires char, not unsigned char while (in.get(ch)) { my_freqs[ch]++; my_corpus_total++; } } void UnigramTextClassifier::learn(char* fname) { ifstream instr(fname); learn(instr); instr.close(); } void UnigramTextClassifier::dump(ostream& out) { frequency_map m = this->freqs(); long tot = this->corpus_total(); out << "// UnigramTextClassifier created this file from " << tot << " characters. It was created on " + ctime_string() << ".\n"; out << "Classification\t" << this->classification() << endl; out << "Total\t" << tot << endl; for (frequency_map::iterator it=m.begin(); it != m.end(); it++) { out << (unsigned int)it->first << "\t" << it->second << "\t" << info_value((float)(it->second)/tot) << endl; } } void UnigramTextClassifier::dump(char* fname) { ofstream instr(fname); dump(instr); instr.close(); } void UnigramTextClassifier::read(istream& in) { string classification; unsigned char ch; long frequency; char text[256]; in.getline(text,256,'\n'); // throw away first line in.getline(text,256,'\t'); // ignore class header in.getline(text,256,'\n'); // get class classification = (string)text; this->setClassification(classification); in.getline(text,256); // throw away 'total' line while (true) { in.getline(text,256,'\t'); if (text[0]==0) break; ch = (unsigned char)atoi(text); in.getline(text,256,'\t'); frequency = atol(text); my_freqs[ch] = frequency; my_corpus_total+= frequency; in.getline(text,256); // ignore rest of line } } void UnigramTextClassifier::read(char* fname) { ifstream instr(fname); read(instr); instr.close(); } float UnigramTextClassifier::score (istream& in) { float br = bits_required(in); long tot = total()*8; float sc = 1 - br/tot; return (sc >= 0 ? sc : 0.0) ; } float UnigramTextClassifier::score(char* fname) { ifstream instr(fname); return score(instr); instr.close(); } float UnigramTextClassifier::bits_required(unsigned char ch) { frequency_map m = this->freqs(); long tot = this->corpus_total(); long frequency = ((!m[ch]) ? 1 : m[ch]) ; return info_value((float)frequency/tot); } float UnigramTextClassifier::bits_required(istream& in) { char ch; // note: in.get requires char, not unsigned char float bits = 0; frequency_map tfreq; // count the frequencies in this text while (in.get(ch)) { tfreq[(unsigned char)ch]++; my_total++; } // iterate through the temp map adding bits * frequency for (frequency_map::iterator it=tfreq.begin(); it != tfreq.end(); it++) { bits += bits_required(it->first) * it->second; } return bits; } float UnigramTextClassifier::bits_required(char* fname) { ifstream instr(fname); return bits_required(instr); instr.close(); } float UnigramTextClassifier::lg(float n) { const float log2 = log(2.0); return log(n)/log2; } float UnigramTextClassifier::info_value(float n) { return -lg(n); } string UnigramTextClassifier::ctime_string() { time_t rawtime; struct tm * timeinfo; time (&rawtime); timeinfo = localtime(&rawtime); char tmpbuf[128]; strftime(tmpbuf,128,"%Y-%m-%d %H:%M:%S",timeinfo); return string(tmpbuf); }
Here are the tricky bits:
CC = g++ CFLAGS = -Wall -g all: ilearn hamspam doc UnigramTextClassifier.o: UnigramTextClassifier.h UnigramTextClassifier.cpp $(CC) $(CFLAGS) -c UnigramTextClassifier.cpp ilearn: UnigramTextClassifier.o ilearn.cpp $(CC) $(CFLAGS) UnigramTextClassifier.o ilearn.cpp -o ilearn hamspam: UnigramTextClassifier.o hamspam.cpp $(CC) $(CFLAGS) UnigramTextClassifier.o hamspam.cpp -o hamspam clean: rm -f core *.o *~ doc: UnigramTextClassifier.h doxygen.cfg doxygen doxygen.cfg