Thinking Anew

The Elephant Rider in Action

I was very excited to get my hands on a copy of Mahout in Action. Artificial Intelligence and Machine Learning are my most favourite subjects in Computer Science. Back during my school days, AI and ML were primarily used in research, and sometimes banks use them for fraud detection. I am glad that it is gradually being adopted in mainstream businesses with the help of easy to use frameworks like Mahout, backed by commodity distributing platforms like Hadoop.

The book is broadly organized into three main categories of algorithms currently implemented in Mahout: Recommendations, Clustering and Classification. The authors take a single strategy with each category. They begin with a general introduction, the algorithms available and basic examples. Later chapters go in depth on each group of algorithms with even more examples and measuring outcomes.

In many tutorial books and articles, we often find examples where data is magically pre-formatted for use and the outcome is pretty much orchestrated. I like the fact that the authors bother to go through the rigor of obtaining and preparing your data that includes cleaning, transformation etc. real-world data. Learning to pre-process your data is crucial for any successful ML application. Data sources are often “dirty”. For ML algorithms to be useful, we must learn to deal with the imperfect world of data collection.

Many of the examples presented in the book are also familiar with most people, e.g. classifying news articles, recommending products, clustering Twitter users. I find real-world examples and common knowledge a great way to learn a new subject. Some of these examples might even be useful and can be applied to existing projects. Bear in mind that many data sets that they point you to are actual data collected from familiar websites and applications.

The authors also examine and provide examples on using Hadoop to support your Mahout implementation. As pointed out in the text, using ML techniques over large datasets often puts a strain on your computing resources. Mahout was designed from ground up to use Hadoop for High Performance Computing. You could buy a focused book to read about Hadoop (e.g. Hadoop in Action), but I find the coverage is sufficient enough to get you started. In the examples, again you will learn to prepare your data, not only for the analytics, but formulating a data structure that is compatible with Hadoop.

Learning about Machine Learning can be very daunting. It sometimes is dry and confusing for me, especially when books focus a lot on the Mathematics involved. I have found the authors approach to this field of Computer Science well balanced. There is enough details to get you started, but a whole lot more hand-holding, through examples, to demonstrate how ML can be useful in your projects.

In summary, this another great book to add to my fairly large collection of Machine Learning textbooks. Here’s why:

  • Authors are major contributors
  • Examples examples examples
  • Practical examples
  • Two books for the price of one