GithubHelp home page GithubHelp logo

hohlick / power-query-excel-formats Goto Github PK

View Code? Open in Web Editor NEW
48.0 8.0 13.0 159 KB

A collection of M code to get various formats from Excel sheets in Power Query

License: MIT License

powerquery powerbi m workbook sheet excel

power-query-excel-formats's Introduction

README на русском

Power-Query-Excel-Formats

A collection of M code to get various formats from Excel sheets in Power Query

Main purpose

Information, stored in the Excel workbooks, often has additional metadata, important for analyzis. This metadata could be stored in various forms, mostly as cell formats, number formats, colours, etc. Often a row, column or cell format is a critical element of the workbook data set.

At the moment (Aug 2017) the Microsoft Power Query and corresponding "Query Editor" in Microsoft Power BI do not allow users to get additional information (stored in Excel workbooks and spreadsheets as various applied formats) natively, except (sometimes) the data types of calculated values.

A wide range of formats and the complexity of extracting their parameters by other tools, such as Power Query, lead to the loss of a noticeable piece of information. Additional problem is storing extracted formats data in Power Query for further use. Задачи и методы

Tasks

Develop a set of functions to extract/import specific info about sheet and/or cell formats into Power Query.

In the future - develop universal functions:

  • spreadsheet information (info about rows, columns, sheet in whole)
  • cells info (colors, fonts, alignment, number formats, indents etc.)

The versatility of the methods due to the same tools (unzip and XML parsing) and the similarity of data sources. Specific kind of function result can be selected via function argument.


Methods

Unzip

Main method is unpacking of XLSX/XLSM as zip and working with XML documents inside. Unpack performed via custom function UnZip.pq by Mike White. But any other analogue to unpack zip archives in Power Query can be used.

XML Parsing

After UnZip the XML files (binary type) from workbook structure become available for the (current) main function. Possible parse methods - with built-in functions Xml.Tables or Xml.Document, or with other suitable XML parsing methods.

  • Main problem: cell formats stored separate from cells, cells itself stored inside row element, cell address stored in A1 notation (need additional convert to R1C1-style or similar).
  • Additional problem: linking/mapping extracted format info with cell position in Power Query table.

Work plan

(released projects have hyperlinks)

  1. Sheet structure:
    • rows outline levels,
    • columns outline levels,
    • extended rows state (visibility, spans, outlines, collapsed, etc.),
    • extended columns state.
  2. Cell indents and alignment
  3. Cell number formats
  4. Cell color
  5. Top-left rows and columns addition to UsedRange/dimension (see this post about UsedRange pitfall)
  6. Additional formats, conditional formats and further development

power-query-excel-formats's People

Contributors

hohlick avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

power-query-excel-formats's Issues

Add parameters

  1. Parameters
  • exclude WSRel
  • add column number (R1C1) as parameter
  • add column letter (A1) as parameter
  1. With column parameter - make pre-selection of cells (dont sure if it will help performance).

Replace argument FullPath with FileContent

Hi Max,
I tried to use this function in one of my solutions. However, current version cannot be used for Scheduled Refresh as Power BI can't determine data source.

I changed function - replaced argument FullPath with FileContent.
So, in query it look like
FullPath = "C:\Temp\File.xlsx"
file = File.Contents(FullPath)
Source = fGetNumberFormats( file , "SheetName", 1, true)
... and so one...

Scheduled refresh works fine with such structure.

BR,
Ivan

This is common part for sheet relations

Source = Excel.Workbook(File.Contents(FullPath), false, true),
// leave sheets only
FilteredSheets = Table.SelectRows(Source, each ([Kind] = "Sheet")),
// sheets in PQ initially in appearance order, i.e. sheets index (despite visibility)
AddSheetsIndex = Table.AddIndexColumn(FilteredSheets, "Index", 1, 1),
// check SheetNames parameter
SheetNames = if SheetNames is text then {SheetNames} else if SheetNames is list then SheetNames else null,
// filter sheets by name if provided
FilteredByNames = if SheetNames = null or List.IsEmpty(SheetNames) then AddSheetsIndex else Table.SelectRows(AddSheetsIndex, each List.Contains(SheetNames, [Name])),
// UnZip file
UnZipped = Table.Buffer(fnUnZip(File.Contents(FullPath))),
/*
let
Source = Folder.Files(Folder),
file = Source{[Name = FileName, Folder Path = Folder & "\"]}[Content],
UnZippedFile = Table.Buffer(fnUnZip(file))
in
Table.Buffer(UnZippedFile),
*/
// relations id table for sheets
workbook =
let
Source = UnZipped,
Content = Source{[FileName ="xl/workbook.xml"]}[Content],
ImportedXML = Xml.Tables(Content,null,TextEncoding.Utf8),
sheetsTable = ImportedXML{[Name = "sheets"]}[Table],
sheetTable = sheetsTable{[Name = "sheet"]}[Table],
ExpandedRel = Table.ExpandTableColumn(sheetTable, "http://schemas.openxmlformats.org/officeDocument/2006/relationships", {"Attribute:id"}, {"Attribute:id"}),
typed = Table.TransformColumnTypes(ExpandedRel,{{"Attribute:name", type text}, {"Attribute:sheetId", Int64.Type}, {"Attribute:id", type text}})
in
typed,
// sheets relations id to XML target files
workbook_rels =
let
Source = UnZipped,
Filtered = Table.SelectRows(Source, each [FileName]="xl/_rels/workbook.xml.rels"),
GetXML = Table.TransformColumns(Filtered, {"Content", each Xml.Tables(_,null,65001)}),
XMLContent = GetXML{0}[Content]{[Name="Relationship"]}[Table],
FilteredSheetsRel = Table.SelectRows(XMLContent, each [#"Attribute:Type"] = "http://schemas.openxmlformats.org/officeDocument/2006/relationships/worksheet"),
Removed = Table.RemoveColumns(FilteredSheetsRel,{"Attribute:Type"})
in
Removed,
// merge relations id (via sheets name)
MergedRelationsID = Table.Join(FilteredByNames, {"Name"}, workbook, {"Attribute:name"}),
// join workbook relations
MergedRelationsTarget = Table.Join(MergedRelationsID,{"Attribute:id"},workbook_rels,{"Attribute:Id"}),

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.