Microsoft Power Query (PQ) is an Excel add-in that can support self-service Business Intelligence solutions, but can be also very useful in general work with data, data gathering and data discovering. It allows you to retrieve data from many different sources like relational databases or flat files, but also from any website or even from SharePoint or ODataFeed. In addition to this also allows for integration, data cleaning and transformation the collected data. All of this can be accomplished with using the special M language or just the Excel menu
In this post I will give you an example of using the Power Query. I am going to show one of the most powerful capabilities – downloading data straight from the web with no additional code or software, but you will see most of the available features as well. I would like to show you the whole process from collecting the data through the cleaning, transforming and preparation, to their visualization. Today, we will be playing with the data about Olympic medals from the Wikipedia.
If you would like to start with PQ you need to install the add-in first. You can find the setup files on here: http://www.microsoft.com/pl-pl/download/details.aspx?id=39379 Just keep in mind that it can be install just with the Microsoft Office 2013 Professional Plus, Office 365 ProPlus or Excel 2013 Standalone.
Once you have installed the add-in you will have a new ribbon tab called Power Query which looks like the below. I am not going to describe every button in this tab, but just want to mention about one quite interesting option which is “Online Search” where you can just write in natural language what are you looking for and let to find the data for you. Be sure to try it yourself.
Today, however, we will use the second option: “From Web”. The interesting data are located in the Wikipedia page: http://en.wikipedia.org/wiki/All-time_Olympic_Games_medal_table If you look at this site you may notice that the total medals table are split into two tables: one with the countries that have already won at least one medal in the Olympic games and the second table contains the countries that was participating in the Olympic Games, but haven’t won medals. In this scenario we will skip the other tables. When you select this button you just need to paste the Url of the website where you want to your data come from and click “OK”
System will automatically analyses the structure of the web page code and will provide you detected structures. As you can see the results list contains the two interesting tables with medals.
To get the data into the worksheet we just need to click “Load” on each table. After that the data will be downloaded into the separate worksheet.
The important thing is that Excel remembers the connections to the specific sources. Actually we do not just retrieve the data into our workbook, but we are building the “process” that later can be easily reproduced. All the operations are translated into the M language and then can be modified and reused. The collected data we can transform after clicking “Edit” on the connection.
Now, you can see the new windows with the following menu. Here you can find the available transformation that you may want to use while preparing your data.
First of all we will handle the first column: Team (ISO code) which have two information: the country name and the country IOC code. To split this information into separate columns we need to choose this column and then select the “Split column” option. Notice that you can use some most popular features at the right mouse button. We are selecting the “Split Column” and then “By Delimiter…”.
We need to specify the splitting conditions. We are going to split the column by the custom char: “(” and we are going to split at the right most delimiter.
The information is divided into separate column, but we need to clean the new column which looks like: “ALB)”. Make another split operation by “(“, but now at the most right delimiter.
To make sure that we have just the country name and just the IOC code in the second column with no blank spaces before or after the values we can trim the attributes. To do them we can select “Transform” and then “Trim”.
After splitting we have one extra attribute that are actually empty. We can easily delete the unnecessary columns by selecting the option “Remove”.
To finish the preparation we are going to Rename each columns and make sure that each column have the correct data type.
We are done with the first table. Each column have an individual, uncontaminated value; each column have a correct data type and each column have the name that uniquely identify the column.
To save our work we can simply click the “Apply & Close” button.
The generated query will produce the transformed data into our workbook .
To combine the data from two different sources correctly we need to have an identical structure in each sources. Repeat the same steps for the second table and save your work.
Now, we have a prepared data in two different tables that have the same structure, but contains different informations. To combine this data together we can simply click the “Append” button.
After that we can provide which tables we want to combine. Let’s choose those tables.
Excel will automatically detect the correlation between the sources. In the case when there is no corresponding column in the second table Excel will generate NULLS for each cells. In our case the NULL values in the column can be simply replaced by 0, because the columns describes the amount of medals gained in each Olympics games type.
To replace those values we can simply click “Replace Values…”
And then we are going to replace null to 0.
At the end we can name the table and after saving the changes the new table will be added to the workbook.
Another useful option of Power Query is the possibility to merging the tables. In this demo we will merge our NOC_AllTable with some additional data about countries. At the beginning we need to get the data. Just like before let’s paste the url: http://www.geonames.org/countries/ into the Power Query and download the data. Before merging we have to make sure that we have a column which will be used as a key to join the tables. In this case we need just to make the country names upper. When we are ready we can click on the “Merge” icon. We will be able then to select the tables and the attributes key in each table. At the bottom we can see how many records Excel find in the second table were found. We can include in new table only matching records or just all. When Excel will not find the corresponding record it will put “nulls”.
After merging we will see the editor for the new table which have the primary table and one additional attribute with “table” type. To add the columns we have to click on the small icon next to the “NewColumn” name and choose the attributes that we want to add.
To finish the work with integration we need to clean the data:
- Duplicate column with continent abbreviation and in new column replace the shortcut with full name
- Replace null in all columns with the “N/A” or 0
- Replace “,” to empty string and “.” To “,” to be able to change the data type to number in the columns with population and area
- Change data types for all attributes
- Rename the attributes
That’s it. With Power Query and some simple operation we was able to get data from two different websites, clean and prepare them for analysis with no code or programming. What’s cool that all the data gained with Power Query are automatically loaded into the data model and with no extra work we can use them in the Power View or just analyze and visualize in the pivot table or charts as here:
What’s also very cool that after refreshing the workbook connections the process with all this steps will be reproduced so we will always have the most recent data.
Power Query is very powerful tool and makes the data retrieval much more easier. That’s awesome that with no programming skills or specific knowledge after some easy steps we can combine the data from many different sources and prepare them to be useful for analysis. Even though the PQ will never replace the full ETL tools it can be definitely used as a part of real ETL process to get some extra information, it can be definitely used in the self-service Business Intelligence or just used for fun and data discovering.
Do you like it? Already have any ideas for using Power Query? Share with your thoughts in the comments and be prepare for more posts about Business Intelligence, Data Mining and more!
Latest posts by sdrzymala (see all)
- WordPress blog analysis in Power BI - May 1, 2016
- Power BI Desktop. An error while changing the name in Power Query - April 30, 2016
- Microsoft SQL Server Data Types – why this is so important? - April 30, 2016