Just before Christmas I bought myself yet another data mining book (i have a few dozen). This one somehow slipped by me for 10 years but I'm glad I finally stumbled upon it. Originally published in 1999, Dorian Pyle wrote "Data Preparation For Data Mining" before Data Mining was less wide spread and 'Predictive Analytics' wasn't the buzz word it is today.
The only few criticisms I could possibly raise are;
1) that everything has a statistical basis.
- For example one technique I use to redistribute heavily skewed data is simple binning by count. I work in telecommunications and the behavioural data is always extremely skewed. Log functions don’t work so I often use SQL to convert variables into 100 percentile bins (where each bin has the same number of rows (customers) in it). That type of insight isn't in the book, but several statistically based alternatives are. I'm not convinced they would work with extremely skewed data, but they are well explained and useful insights.
2) no mention of SQL or step-by-step examples of data manipulation (nothing like 'before and 'after' pictures). Ideas or examples for derived variables are lacking too.
So far I've read through the first 275 pages and the odd additional chapter. Its surprisingly easy to read and explains the statistics well. Its definitely a book I will refer to, and well worth buying.
In February 2004 Dorian Pyle made an interesting post about things to avoid when data mining;
"This Way Failure Lies " http://www.ibmdatabasemag.com/story/showArticle.jhtml?articleID=17602328