C++ Classes, and Real Quick software development

C++ is "C with classes." To explore C++ classes, as well as do a little software engineering, our second "real, quick" program will be a simple text classification tool that could be used, for example, to distinguish between "spam" (unwanted email) and "ham" (email we want). The basic idea is that texts from different classifications will have different character frequency characteristics (spam email from sex-sites might have lots of "x's" in them, for example). First, we'll train our text classifier on texts which have been marked as which class they belong too (We'll call this program ilearn, for "information-based learning"). Second, we'll test our classifier by seeing if it can distinguish between ham and spam. We'll call this program hamspam.

Here are the programs in action. We train the classifiers on files in ham and spam directories, and then test them on files in separate directories:

% cat /ham/ham.0/* | ilearn Ham > ham.txt
% cat /spam/spam.0/* | ilearn Spam > spam.txt
% hamspam ham.txt spam.txt /ham/ham.1/*
Total: 500 Ham: 497 Spam: 3
% hamspam ham.txt spam.txt /spam/spam.1/*
Total: 500 Ham: 244 Spam: 256

The spam filter is about 50% effective, and only occasionally mis-classifies a real email as a piece of spam. This is somewhat impressive for something so simple. Not impressive enough to use on a day-to-day basis, perhaps, but impressive enough.

A (very) little modeling

Given the example above, a text classifier needs at least two methods: a learn method (for training the classifier) and some kind of score method that will score a text against what it's learned. It would also be useful to dump training data and read it back again. Because we're basing our classifier on single characters ("unigrams"), we'll call the class UnigramTextClassifier.

Here is the code for hamspam.cpp:

#include <iostream>
#include "UnigramTextClassifier.h" 

using namespace std;
using namespace TextClassifier;

int main(int argc,char *argv[]) 
{ 
  UnigramTextClassifier ham = UnigramTextClassifier();
  UnigramTextClassifier spam = UnigramTextClassifier();
  ham.read(argv[1]);
  spam.read(argv[2]);
  int hams = 0;
  int spams = 0;
  int total = 0;
  for (int i=3; i<argc; i++)
    {
      total++; 
      if (ham.score(argv[i]) > spam.score(argv[i])) 
        hams++;
      else  
        spams++;
    }
  cout << "\nTotal: " << total << " Ham: " << hams << " Spam: " << spams << endl;
}

In addition to the #include <iostream> statement, there is this statement: #include "UnigramTextClassifier.h". It is standard, in writing C++ programs, to place function and class declarations in a separate file so that the program linker knows how to link object files together. When the #included file is given in double quotation marks, the definitions are looked for locally (usually in the same directory). We'll get to the definitions soon; we are seeing how these definitions are used in new programs.

A "C-style string" is an array of bytes, the last byte containing a 0. C++ also has, as we have seen, its own string type.

The main function, in this case, has two arguments, argc and argv. The variable argc contains the number of arguments given; it is at least 1 (because the program name is considered an argument). The variable argv is an array of C-style strings which are the arguments (the zeroth string is the program name).

In our short program, which does no error checking, argv[1] is the filename for the dumped ham frequencies, and argv[2] is the filename for the dumped spam frequencies. The remaining entries in argv (that is, from 3 to argc-1), contain filenames to check. For each file, if the score of the ham classifier is greater than the spam classifier, the number of purported ham files is incremented; otherwise the number of purported spam files is incremented. After all the files are processed, the totals are printed.

Heads will roll

The contents of a header file for a class contain the methods and data members of the class. Here's the header file for UnigramTextClassifier. There are a few things of note to start with: notice how some blocks are declared public and others private (I tend to follow the convention that data members should always be private and accessed through methods). Also notice that the class block ends in a semi-colon; leaving this out is a standard novice mistake.

#ifndef UnigramTextClassifier_H
#define UnigramTextClassifier_H
#include <map>
#include <iostream>
#include <fstream>

namespace TextClassifier 
{
  typedef map<unsigned char,unsigned long> frequency_map;
  class UnigramTextClassifier 
  {
  public:
    UnigramTextClassifier();
    UnigramTextClassifier(const string classification);

    frequency_map freqs() { return my_freqs; }
    unsigned long corpus_total() { return my_corpus_total; }
    unsigned long total() { return my_total; }
    string classification() { return my_classification; }
    void setClassification(string& classification) {my_classification = classification;}

    void UnigramTextClassifier::learn(istream& in);
    void UnigramTextClassifier::learn(char* in);
    
    void UnigramTextClassifier::dump(ostream& out);
    void UnigramTextClassifier::dump(char* out);

    void UnigramTextClassifier::read(istream& in);
    void UnigramTextClassifier::read(char* in);
  
    float UnigramTextClassifier::score(istream& in); 
    float UnigramTextClassifier::score(char* in) ;
     
    float UnigramTextClassifier::bits_required(unsigned char ch);
    float UnigramTextClassifier::bits_required(istream& in);
    float UnigramTextClassifier::bits_required(char* in);
  private:
    /* internal character->frequency map */
    frequency_map my_freqs;
    /* internal total number of characters in corpus */
    unsigned long my_corpus_total;
    /* internal total number of characters in text */
    unsigned long my_total;
    /* internal name of classifer */
    string my_classification;
    /* internal base-2 logarithm */
    float UnigramTextClassifier::lg (float n);
    /* internal information value function -lg(n) */
    float UnigramTextClassifier::info_value(float n);
    /* internal current time stream */
    string UnigramTextClassifier::ctime_string();
  };
}
using namespace std;

#endif /* UnigramTextClassifier_H_ */

Four other things of note:

The file starts with #ifndef UnigramTextClassifier_H #define UnigramTextClassifier_H. and ends in #endif. These are macro directives. In general, it's a good idea to avoid the macro system in C++ (in my opinion, anyway). But one place they are crucial is in header files, and the statements in this example are a paradigmatic case. If there's any chance that the header will be included twice, these macros ensure the text between #ifndef ("if not defined") and the end is not reread by the compiler. Similarly, the TextClassifier namespace is wrapped around the class header to declare a special namespace (which we use in the hamspam program).
Some of the methods--the really short ones--are defined right in the header. When a class method is defined in the #class block, it is defined in line, resulting in (theoretically) faster code. One standard is to define the field accessors in this way, and that is what this code does. I've used the convention of defining the private fields as starting with an "my" (for example, string my_classification.
Notice that C++ can have the same method name with different method parameters. I provide methods for a filename and a stream for the basic operations we defined above.
There isn't much documentation shown above. In fact, I recommend using Doxygen for code documentation. You'll need to see the Doxygen documentation for details, but see also the UnigramTextClassifier documentation Doxygen created for a sample.

Class walfare

Given this header file, what does the class code look like? Here's the code: it's not very trickly code.

#include <time.h>
#include "UnigramTextClassifier.h" 

using namespace std;

UnigramTextClassifier::UnigramTextClassifier() 
{
  my_total = 0;
  my_corpus_total = 0;
  my_classification = "Unknown"; 
}


UnigramTextClassifier::UnigramTextClassifier(const string classification) 
{
  my_total = 0;
  my_corpus_total = 0;
  my_classification = classification; 
}

void UnigramTextClassifier::learn(istream& in)
{
  char ch; // note: in.get requires char, not unsigned char
  while (in.get(ch))
    {
      my_freqs[ch]++;
      my_corpus_total++;
    }
}

void UnigramTextClassifier::learn(char* fname)
{
  ifstream instr(fname);
  learn(instr);
  instr.close();
}

void UnigramTextClassifier::dump(ostream& out) 
{
  frequency_map m = this->freqs();
  long tot = this->corpus_total();
  out << "// UnigramTextClassifier created this file from " << tot
       << " characters. It was created on " + ctime_string() << ".\n";
  out << "Classification\t" << this->classification() << endl;
  out << "Total\t" << tot << endl;
  for (frequency_map::iterator it=m.begin(); it != m.end(); it++) 
    {
      out << (unsigned int)it->first << "\t" << it->second << "\t" 
          << info_value((float)(it->second)/tot) << endl;
    }
}

void UnigramTextClassifier::dump(char* fname)
{
  ofstream instr(fname);
  dump(instr);
  instr.close();
}

void UnigramTextClassifier::read(istream& in)
{
  
  string classification; 
  unsigned char ch; 
  long frequency;
  char text[256];
  in.getline(text,256,'\n');  // throw away first line
  in.getline(text,256,'\t'); // ignore class header
  in.getline(text,256,'\n'); // get class
  classification = (string)text;
  this->setClassification(classification);
  in.getline(text,256); // throw away 'total' line
  
  while (true)
    {
      in.getline(text,256,'\t'); 
      if (text[0]==0) break;
      ch = (unsigned char)atoi(text);
      in.getline(text,256,'\t');
      frequency = atol(text);
      my_freqs[ch] = frequency;
      my_corpus_total+= frequency;
      in.getline(text,256); // ignore rest of line
    }
}


void UnigramTextClassifier::read(char* fname)
{
  ifstream instr(fname);
  read(instr);
  instr.close();
}

float UnigramTextClassifier::score (istream& in) 
{
  float br = bits_required(in);
  long tot = total()*8;
  float sc = 1 - br/tot;
  return (sc >= 0 ? sc : 0.0) ;
}

float UnigramTextClassifier::score(char* fname)
{
  ifstream instr(fname);
  return score(instr);
  instr.close();
}


float UnigramTextClassifier::bits_required(unsigned char ch) 
{
  frequency_map m = this->freqs();
  long tot = this->corpus_total();
  long frequency = ((!m[ch]) ? 1 : m[ch]) ;
  return info_value((float)frequency/tot);
}

float UnigramTextClassifier::bits_required(istream& in)
{
  char ch; // note: in.get requires char, not unsigned char
  float bits = 0;
  frequency_map tfreq; 
  // count the frequencies in this text
  while (in.get(ch))
    {
      tfreq[(unsigned char)ch]++;
      my_total++;
    }
  // iterate through the temp map adding bits * frequency
  for (frequency_map::iterator it=tfreq.begin(); it != tfreq.end(); it++) 
    {
      bits += bits_required(it->first) * it->second;
    }
  return bits;
}

float UnigramTextClassifier::bits_required(char* fname)
{
  ifstream instr(fname);
  return bits_required(instr);
  instr.close();
}

float UnigramTextClassifier::lg(float n) 
{
  const float log2 = log(2.0);
  return log(n)/log2;
}

float UnigramTextClassifier::info_value(float n) 
{
  return -lg(n);
}

string UnigramTextClassifier::ctime_string() 
{
  time_t rawtime;
  struct tm * timeinfo;
  time (&rawtime);
  timeinfo = localtime(&rawtime);
  char tmpbuf[128];
  strftime(tmpbuf,128,"%Y-%m-%d %H:%M:%S",timeinfo);
  return string(tmpbuf);
}

Here are the tricky bits:

Standard character input and output facilities are relatively primitive. My program wants to intrepret a character as an 8-bit, unsigned byte (which is what the type unsigned char guarantees. It's easy enough to cast a signed char to unsigned. In the data file, I put out the integer value of the character, its frequency, and the number of bits require. When reading the data back in, I ignore the bits required because they're not necessary (but they are nice to see in the output file). The ugliest pieces of code here, to my mind, is the code dealing with input and output. For example, the getline method for streams requires you to guess at how large a string you'll need to read in.
The score method essentially returns the ideal percentage of compression achievable given the frequencies in the corpus text and the text to be scored. Since it's possible for the number of bits required to encode a text to actually be larger than encoding it with just 8-bit bytes, the score method returns 0 in this case. Thus, we get a nice score that ranges from 0.0 to 1.0.
For what it's worth, when I originally wrote the bits_required method, I calculated a running bit total on each character. Later, I rewrote this to count the frequencies first, and then run through the (up to) 256 frequencies to calcuate the bit total. This was over 10 times faster--as usual, integer arithmetic is faster than floating point arithmetic. (Note also that the I assume that the frequency in the corpus is at least 1. Otherwise, it would require an infinite number of bits to encode!)

Make me an offer I can't refuse

I compiled these programs under G++, the C++ compiler from the Free Software Foundation. I also used GNU Make to do incremental compilation. A discussion of make is beyond this tutorial, but here is the Makefile I used:

CC = g++
CFLAGS = -Wall -g

all: ilearn hamspam  doc

UnigramTextClassifier.o:     UnigramTextClassifier.h UnigramTextClassifier.cpp
        $(CC) $(CFLAGS) -c UnigramTextClassifier.cpp

ilearn:      UnigramTextClassifier.o ilearn.cpp
        $(CC) $(CFLAGS) UnigramTextClassifier.o ilearn.cpp -o ilearn

hamspam:     UnigramTextClassifier.o hamspam.cpp
        $(CC) $(CFLAGS) UnigramTextClassifier.o hamspam.cpp -o hamspam

clean:
        rm -f core *.o *~

doc: UnigramTextClassifier.h doxygen.cfg
        doxygen doxygen.cfg