Building Open Source BI Solutions with Pentaho and MySQL

Authors: Matt Casters (Chief Data Integration, Pentaho), Roland Bouman (Curriculum Developer, Sun Microsystems)

Datawarehousing and Business Intelligence are areas of ICT that have been associated with large corporate companies, such as banks, retailers and insurance companies. However, several recent developments indicate that the scope of Business Intelligence and Datawarehousing is increasing.

Most notably, the usage of the internet has increased the need for BI and DWH in areas where it was unknown and unused: - Analysis of web traffic can lead to better insight in marketing opportunities and improvement of websites -Typical web and web 2.0 companies have large data sets that can only be efficiently stored stored and analyzed using techniques and methodologies used in business intelligence and data warehousing - Efficient management of internet advertising (for example using google ads) can be better achieved using DWH's and BI.

Another important factor is the ongoing commoditization of the software required for Business Intelligence and Data Warehousing. Most notably, the free and open software world has produced a number of products that are a viable alternative to and can compete with proprietary offerings at a substantial lower cost:
- Open source databases (MySQL, PostgreSQL) - Open source BI Stacks (Pentaho, Jasper) - Open source application servers (Tomcat, JBoss, Glassfish)

In this presentation will provide a brief, to the point overview of BI and Datawarehousing and demonstrate how an effective but simple Datawarehouse can be built using only Free and Open Source software.

Introduction to Datawarehousing
- what is datawarehousing?
- why would I want a data warehouse?
- database to support business intellingce
- Software products and solutions: Open Source vs Proprietary

Designing a Datawarehouse
- case: XXX sample database
- sakila? imdb? iTunes?
- dimensional modelling and star schema

Loading a Datawarehouse
- What is ETL?
- Extracting from existing databases
- Extracting raw data from text files, spreadsheets, or the internet
- Cleaning and transforming data:
- filtering
- mapping
- (de)normalization