Introduction
A key issue in data mining is not only generating the results but also interpreting them. Unfortunately it is often difficult to interpret the results of data mining and many users are turning to Excel and its data mining add-in tool.
Requirements
You will need to have permission to access SQL Server Analysis Services. You will need to have Excel installed (preferably 2007 or later). The Microsoft SQL Server Data Mining Add-ins for Microsoft Office 2007 can be downloaded from:
This package includes two add-ins for Microsoft Office Excel 2007 (Table Analysis Tools and Data Mining Client) and one add-in for Microsoft Office Visio 2007 (Data Mining Templates). In this article, will focus on the Table Analysis Tool for Excel 2007.
Unfortunately, this add-in does not work in a 64 bit environment and Microsoft has even said we won’t get 64bit support version in the near future.
Installation
After the standard dialog box in the installation, you will get the following dialog requesting you to select your installation options.
By default, the Table Analysis Tools for Excel is selected and if you wish, you can select the other two options.
Configuration
In the start menu under Microsoft SQL 2008 Data Mining Add-ins, you will see a short cut named Server Configuration Utility which is the tool you to configure the add-in.
At the first dialog, you need to enter the analysis server name, the authentication method should be Windows Authentication as analysis services does not support SQL Server Authentication.
You need to allow the creation of the temporary mining model in SQL Server Analysis Service(SSAS). You can enable this from SQL Server Analysis Service by selecting the properties of the Analysis Service logging in to the SQL Server Analysis Services from SQL Server Management Studio.
Also, you can configure this from the installation wizard.
In the wizard, the next step is to create a SSAS database or else you can allow temporary mining models in the SSAS database.
With the installation, you will get a sample Excel file named DMAddings_SampleData.xlsx to test with. When you open the excel file, in the Analyze ribbon you will see that available for table analyze:
The below table lists the data mining algorithms used by each analysis option.
Menu Option | Data Mining Algorithm |
Analyze Key Influencers | Naïve Bayes |
Detect Categories | Clustering |
Fill form Example | Logistic Regression |
Forecast | Time Series |
Highlight Exceptions | Clustering |
Scenario Analysis (Goal Seek) | Logistic Regression |
Scenario Analysis (What If) | Logistic Regression |
Prediction Calculator | Logistic Regression |
Shopping Basket Analysis | Association Rule |
Analyze Key Influences
When you have a data set there can be several attributes. For example, in case of a customer dataset, there can be several attributes like Marital Status, Gender, Income and Occupation etc. Analyze Key Influences examines what are the influencing factors for a given attribute. In the Table Analysis Tools Sample sheet, you can find a table with customer attributes.
Let us assume that we wish to analyse the Occupation column for key factors.
1. First select Analyze Key Influences from the Analyze ribbon.
2. Next select the Occupation column for analyze:
3. Next, choose the columns for analysis by clicking Choose columns to be used for analysis
By default, all columns are selected. However, you can ignore columns such as ID. If there are derived columns you should ignore these as well. For example, there can be a bonus column which is based on a percentage of the salary column.
Press OK for this dialog and click Run for the other dialog and you will get the following results in a new sheet in the same Excel file.
The abThe above table indicates that Income range 39,050 – 71,062 is more favored to Skilled Manual Occupation.
The Relative Impact column, which is the last column, indicates the relative impact to the Occupation. This is a type of rank with strongest getting a score100 and the weakest a score of 0.
In case you want to find to out the relationship between selected factors, you can use the table filtering option in Excel. By doing this you can ignore unnecessary values. For example, if you want to find out the influencing factors for Age and Income towards Occupation, you can filter the column field and it will look as below:
Discrimination on Key Influencers
When generating a key influencing report, you can add any number of discrimination reports which compare how key factors differentiate between them.
From the following dialog box you can view a report of the discrimination between occupations of Skilled manual and Clerical occupations.
The below shows the discrimination report between Skilled Manual and Clerical. The final column was added manually and not by the add-in tool. This column is the sum of column 3 and column 4.
The above report will list factors that will influence for the selected occupations in descending order. You can add any number of discrimination reports, but you have to generate them at the instance and so you cannot add them later.
The influencers presented in the discrimination report are often different from those presented in the main report. Some influencers from the main report are not present in the discrimination report. Those are important when comparing Skilled manual and Clerical against all other occupation types.
Detect Categories
When working with large sets of data, it is much easier to deal with a smaller and more manageable sub set of the data which has similar behavior patterns. This practice is called Customer Segmentation.
The Detect Categories table analysis tools finds natural groups in the data. It analyses your data, finds the most common combinations of column values and then defines groups based on these common patterns.
As in the previous case, you can use the Table Analysis Tools Sample Excel sheet. After selecting the table in the sheet, select Detect Categories from the Analyze table which will bring up following dialog box:
As in the previous case, you need to select the attributes that need to be categorized.
Fill Form
Fill form is a prediction technique using sample or known data. If you navigate to the next sheet, Fill Form Example, you will see something like this:
In the above data set you will see High Value Customer for the first 10 rows but the rest of the rows are not filled.
Fill Form will identify the pattern for the existing data and it will fill the blank columns.
You need to click the Fill Form button from the add-in ribbon, following screen will be displayed:
In the above dialog, you need to select the column that need to predict. By clicking the link Choose columns to be used for analysis you can select the columns you want to take account for in making the predictions. For example, you can ignore the ID column.
When you click Run button, the new column High Value Customer_Extended is added to the sheet in which you will get the predictions for the high value customer. Apart from this, the new sheet will be added as follows.
The above sheet will tell you what the most and least influencing factors are when predicting the selected column.
Highlight Exceptions
When collecting data, there can be anomalies due to various reasons. Data mining techniques can be used to detect those anomalies. Highlight Exceptions is a technique available to detect anomalies.
After getting into the sheet and selecting the columns you need to analyze, click the Highlight Exception button.
The exception rows will be highlighted, along with the column which is the reason for the row to become exception.
Apart from highlighting the exception rows, there will be another sheet added to the workbook which is summarization of anomalies:
This should get you started using Excel for SQL Server data mining, in future I hope to write another articles explaining other techniques such as Forecast, Scenario Analysis, Prediction Calculator and Shopping Basket Analysis.
No comments:
Post a Comment