std::UnigramTextClassifier | A text classifier based on single characters. The basic idea: texts from the same class will tend to have character (byte) frequencies that are similar. In information theoretical terms, texts from the same class should require the same number of bits to encode them in a perfect encoding. We don't actually have to create the encoding, just use the number of bits. The basic methods are learn (read a corpus and count the frequencies), dump (save the frequencies to a stream) and read, read the frequencies from a stream |