Quinlan J.R. — C4.5: Programs for Machine Learning :: Электронная библиотека попечительского совета мехмата МГУ

Despite its age this classic is invaluable to any serious user of See5 (Windows) or C5.0 (UNIX). C4.5 (See5/C5) is a linear classifier system that is often used for machine learning, or as a data mining tool for discovering patterns in databases. The classifiers can be in the form of either decision trees or rule sets. Just like ID3 it employs a "divide and conquer" strategy and uses entropy (information content) to compute its gain ratio (the split criteria).

C5.0 and See5 are built on C4.5, which is open source and free. However, since C5.0 and See5 are commercial products the code and the internals of the See5/C5 algorithms are not public. This is why this book is still so valuable. The first half of the book explains how C4.5 works, and describes its features, for example, partitioning, pruning, and windowing in detail. The book also discusses how C4.5 should be used, and potential problems with over-fit and non-representative data. The second half of the book gives a complete listing of the source code; 8,800 lines of C-code.

C5.0 is faster and more accurate than C4.5 and has features like cross validation, variable misclassification costs, and boost, which are features that C4.5 does not have. However, since minor misuse of See5 could have cost our company tens of millions of dollars it was important that we knew as much as possible about what we were doing, which is why this book was so valuable.

The reasons we did not use, for example, neural networks were:
(1) We had a lot of nominal data (in addition to numeric data)
(2) We had unknown attributes
(3) Our data sets were typically not very large and still we had a lot of attributes
(4) Unlike neural networks, decision trees and rule sets are human readable, possible to comprehend, and can be modified manually if necessary. Since we had problems with non-representative data but understood these problems as well as our system quite well, it was sometimes advantageous for us to modify the decision trees.

If you are in a similar situation I recommend See5/C5 as well as this book.