Skip to main content
Version: Next

PowerBI

Certified

Important Capabilities

CapabilityStatusNotes
Asset ContainersEnabled by default
Column-level LineageDisabled by default, configured using extract_column_level_lineage.
Data ProfilingOptionally enabled via configuration profiling.enabled
DescriptionsEnabled by default
Detect Deleted EntitiesOptionally enabled via stateful_ingestion.remove_stale_metadata
Extract OwnershipEnabled by default
Extract TagsEnabled by default
Platform InstanceEnabled by default
Schema MetadataEnabled by default
Table-Level LineageEnabled by default, configured using extract_lineage.

This plugin extracts the following:

  • Power BI dashboards, tiles and datasets
  • Names, descriptions and URLs of dashboard and tile
  • Owners of dashboards

Configuration Notes

  1. Refer Microsoft AD App Creation doc to create a Microsoft AD Application. Once Microsoft AD Application is created you can configure client-credential i.e. client_id and client_secret in recipe for ingestion.

  2. Enable admin access only if you want to ingest dataset, lineage and endorsement tags. Refer section Admin Ingestion vs. Basic Ingestion for more detail.

    Login to PowerBI as Admin and from Admin API settings allow below permissions

    • Allow service principals to use read-only admin APIs
    • Enhance admin APIs responses with detailed metadata
    • Enhance admin APIs responses with DAX and mashup expressions

Concept mapping

PowerBIDatahub
DashboardDashboard
Dataset's TableDataset
TileChart
Report.webUrlChart.externalUrl
WorkspaceContainer
ReportDashboard
PaginatedReportDashboard
PageChart
AppDashboard
  • If Tile is created from report then Chart.externalUrl is set to Report.webUrl.
  • The Page is unavailable for PowerBI PaginatedReport.

Lineage

This source extract table lineage for tables present in PowerBI Datasets. Lets consider a PowerBI Dataset SALES_REPORT and a PostgreSQL database is configured as data-source in SALES_REPORT dataset.

Consider SALES_REPORT PowerBI Dataset has a table SALES_ANALYSIS which is backed by SALES_ANALYSIS_VIEW of PostgreSQL Database then in this case SALES_ANALYSIS_VIEW will appear as upstream dataset for SALES_ANALYSIS table.

You can control table lineage ingestion using extract_lineage configuration parameter, by default it is set to true.

PowerBI Source extracts the lineage information by parsing PowerBI M-Query expression.

PowerBI Source supports M-Query expression for below listed PowerBI Data Sources

  1. Snowflake
  2. Oracle
  3. PostgreSQL
  4. Microsoft SQL Server
  5. Google BigQuery
  6. Databricks

Native SQL query parsing is supported for Snowflake and Amazon Redshift data-sources.

For example refer below native SQL query. The table OPERATIONS_ANALYTICS.TRANSFORMED_PROD.V_UNIT_TARGET will be ingested as upstream table.

let
Source = Value.NativeQuery(
Snowflake.Databases(
"sdfsd788.ws-east-2.fakecomputing.com",
"operations_analytics_prod",
[Role = "OPERATIONS_ANALYTICS_MEMBER"]
){[Name = "OPERATIONS_ANALYTICS"]}[Data],
"select #(lf)UPPER(REPLACE(AGENT_NAME,\'-\',\'\')) AS Agent,#(lf)TIER,#(lf)UPPER(MANAGER),#(lf)TEAM_TYPE,#(lf)DATE_TARGET,#(lf)MONTHID,#(lf)TARGET_TEAM,#(lf)SELLER_EMAIL,#(lf)concat((UPPER(REPLACE(AGENT_NAME,\'-\',\'\'))), MONTHID) as AGENT_KEY,#(lf)UNIT_TARGET AS SME_Quota,#(lf)AMV_TARGET AS Revenue_Quota,#(lf)SERVICE_QUOTA,#(lf)BL_TARGET,#(lf)SOFTWARE_QUOTA as Software_Quota#(lf)#(lf)from OPERATIONS_ANALYTICS.TRANSFORMED_PROD.V_UNIT_TARGETS#(lf)#(lf)where YEAR_TARGET >= 2020#(lf)and TEAM_TYPE = \'foo\'#(lf)and TARGET_TEAM = \'bar\'",
null,
[EnableFolding = true]
),
#"Added Conditional Column" = Table.AddColumn(
Source,
"Has PS Software Quota?",
each
if [TIER] = "Expansion (Medium)" then
"Yes"
else if [TIER] = "Acquisition" then
"Yes"
else
"No"
)
in
#"Added Conditional Column"

Use full-table-name in from clause. For example dev.public.category

M-Query Pattern Supported For Lineage Extraction

Lets consider a M-Query which combine two PostgreSQL tables. Such M-Query can be written as per below patterns.

Pattern-1

let
Source = PostgreSQL.Database("localhost", "book_store"),
book_date = Source{[Schema="public",Item="book"]}[Data],
issue_history = Source{[Schema="public",Item="issue_history"]}[Data],
combine_result = Table.Combine({book_date, issue_history})
in
combine_result

Pattern-2

let
Source = PostgreSQL.Database("localhost", "book_store"),
combine_result = Table.Combine({Source{[Schema="public",Item="book"]}[Data], Source{[Schema="public",Item="issue_history"]}[Data]})
in
combine_result

Pattern-2 is not supported for upstream table lineage extraction as it uses nested item-selector i.e. {Source{[Schema="public",Item="book"]}[Data], Source{[Schema="public",Item="issue_history"]}[Data]} as argument to M-QUery table function i.e. Table.Combine

Pattern-1 is supported as it first assigns the table from schema to variable and then variable is used in M-Query Table function i.e. Table.Combine

Extract endorsements to tags

By default, extracting endorsement information to tags is disabled. The feature may be useful if organization uses endorsements to identify content quality.

Please note that the default implementation overwrites tags for the ingested entities, if you need to preserve existing tags, consider using a transformer with semantics: PATCH tags instead of OVERWRITE.

Profiling

The profiling implementation is done through querying DAX query endpoint. Therefore, the principal needs to have permission to query the datasets to be profiled. Usually this means that the service principal should have Contributor role for the workspace to be ingested. Profiling is done with column-based queries to be able to handle wide datasets without timeouts.

Take into account that the profiling implementation executes a fairly big number of DAX queries, and for big datasets this is a significant load to the PowerBI system.

The profiling_pattern setting may be used to limit profiling actions to only a certain set of resources in PowerBI. Both allowed and deny rules are matched against the following pattern for every table in a PowerBI Dataset: workspace_name.dataset_name.table_name. Users may limit profiling with these settings at table level, dataset level or workspace level.

Admin Ingestion vs. Basic Ingestion

PowerBI provides two sets of API i.e. Basic API and Admin API.

The Basic API returns metadata of PowerBI resources where service principal has granted access explicitly on resources, whereas Admin API returns metadata of all PowerBI resources irrespective of whether service principal has granted or doesn't grant access explicitly on resources.

The Admin Ingestion (explained below) is the recommended way to execute PowerBI ingestion as this ingestion can extract most of the metadata.

Admin Ingestion: Service Principal As Admin in Tenant Setting and Added as Member In Workspace

To grant admin access to the service principal, visit your PowerBI tenant Settings.

If you have added service principal as member in workspace and also allowed below permissions from PowerBI tenant Settings

  • Allow service principal to use read-only PowerBI Admin APIs
  • Enhance admin APIs responses with detailed metadata
  • Enhance admin APIs responses with DAX and mashup expressions

PowerBI Source would be able to ingest below listed metadata of that particular workspace

  • Lineage
  • PowerBI Dataset
  • Endorsement as tag
  • Dashboards
  • Reports
  • Dashboard Tiles
  • Report Pages
  • App

If you don't want to add a service principal as a member in your workspace, then you can enable the admin_apis_only: true in recipe to use PowerBI Admin API only.

Caveats of setting admin_apis_only to true:

  • Report's pages would not get ingested as page API is not available in PowerBI Admin API
  • PowerBI Parameters would not get resolved to actual values while processing M-Query for table lineage
  • Dataset profiling is unavailable, as it requires access to the workspace API

Basic Ingestion: Service Principal As Member In Workspace

If you have added service principal as member in workspace then PowerBI Source would be able to ingest below metadata of that particular workspace

  • Dashboards
  • Reports
  • Dashboard's Tiles
  • Report's Pages

CLI based Ingestion

Starter Recipe

Check out the following recipe to get started with ingestion! See below for full configuration options.

For general pointers on writing and running a recipe, see our main recipe guide.

source:
type: "powerbi"
config:
# Your Power BI tenant identifier
tenant_id: a949d688-67c0-4bf1-a344-e939411c6c0a

# Azure AD Application identifier
client_id: foo
# Azure AD App client secret
client_secret: bar

# Ingest elements of below PowerBi Workspace into Datahub
workspace_id_pattern:
allow:
- 4bd10256-e999-45dd-8e56-571c77153a5f
deny:

# Enable / Disable ingestion of ownership information for dashboards
extract_ownership: true

# Enable/Disable extracting workspace information to DataHub containers
extract_workspaces_to_containers: true

# Enable / Disable ingestion of endorsements.
# Please notice that this may overwrite any existing tags defined to ingested entities!
extract_endorsements_to_tags: false

# Optional -- This mapping is optional and only required to configure platform-instance for upstream tables
# A mapping of PowerBI datasource's server i.e host[:port] to data platform instance.
# :port is optional and only needed if your datasource server is running on non-standard port.
# For Google BigQuery the datasource's server is google bigquery project name
server_to_platform_instance:
ap-south-1.snowflakecomputing.com:
platform_instance: operational_instance
env: DEV
oracle-server:1920:
platform_instance: high_performance_production_unit
env: PROD
big-query-sales-project:
platform_instance: sn-2
env: QA

# Need admin_api, only ingest workspace that are modified since...
modified_since: "2023-02-10T00:00:00.0000000Z"

ownership:
# create powerbi user as datahub corpuser, false will still extract ownership of workspace/ dashboards
create_corp_user: false
# use email to build user urn instead of powerbi user identifier
use_powerbi_email: true
# remove email suffix like @acryl.io
remove_email_suffix: true
# only ingest user with certain authority
owner_criteria: ["ReadWriteReshareExplore","Owner","Admin"]
# wrap powerbi tables (datahub dataset) under 1 powerbi dataset (datahub container)
extract_datasets_to_containers: true
# only ingest dataset that are endorsed, like "Certified"
filter_dataset_endorsements:
allow:
- Certified

# extract powerbi dashboards and tiles
extract_dashboards: false
# extract powerbi dataset table schema
extract_dataset_schema: true

# Enable PowerBI dataset profiling
profiling:
enabled: false
# Pattern to limit which resources to profile
# Matched resource format is following:
# workspace_name.dataset_name.table_name
profile_pattern:
deny:
- .*


sink:
# sink configs

Config Details

Note that a . is used to denote nested fields in the YAML recipe.

FieldDescription
client_id 
string
Azure app client identifier
client_secret 
string
Azure app client secret
tenant_id 
string
PowerBI tenant identifier
admin_apis_only
boolean
Retrieve metadata using PowerBI Admin API only. If this is enabled, then Report Pages will not be extracted. Admin API access is required if this setting is enabled
Default: False
convert_lineage_urns_to_lowercase
boolean
Whether to convert the urns of ingested lineage dataset to lowercase
Default: True
convert_urns_to_lowercase
boolean
Whether to convert the PowerBI assets urns to lowercase
Default: False
enable_advance_lineage_sql_construct
boolean
Whether to enable advance native sql construct for parsing like join, sub-queries. along this flag , the native_query_parsing should be enabled. By default convert_lineage_urns_to_lowercase is enabled, in-case if you have disabled it in previous ingestion execution then it may break lineageas this option generates the upstream datasets URN in lowercase.
Default: True
extract_app
boolean
Whether to ingest workspace app. Requires DataHub server 0.14.2+.
Default: False
extract_column_level_lineage
boolean
Whether to extract column level lineage. Works only if configs native_query_parsing, enable_advance_lineage_sql_construct & extract_lineage are enabled.Works for M-Query where native SQL is used for transformation.
Default: False
extract_dashboards
boolean
Whether to ingest PBI Dashboard and Tiles as Datahub Dashboard and Chart
Default: True
extract_dataset_schema
boolean
Whether to ingest PBI Dataset Table columns and measures
Default: False
extract_datasets_to_containers
boolean
PBI tables will be grouped under a Datahub Container, the container reflect a PBI Dataset
Default: False
extract_endorsements_to_tags
boolean
Whether to extract endorsements to tags, note that this may overwrite existing tags. Admin API access is required if this setting is enabled.
Default: False
extract_independent_datasets
boolean
Whether to extract datasets not used in any PowerBI visualization
Default: False
extract_lineage
boolean
Whether lineage should be ingested between X and Y. Admin API access is required if this setting is enabled
Default: True
extract_ownership
boolean
Whether ownership should be ingested. Admin API access is required if this setting is enabled. Note that enabling this may overwrite owners that you've added inside DataHub's web application.
Default: False
extract_reports
boolean
Whether reports should be ingested
Default: True
extract_workspaces_to_containers
boolean
Extract workspaces to DataHub containers
Default: True
include_workspace_name_in_dataset_urn
boolean
It is recommended to set this to true, as it helps prevent the overwriting of datasets.Read section #11560 at https://datahubproject.io/docs/how/updating-datahub/ before enabling this option.To maintain backward compatibility, this is set to False.
Default: False
m_query_parse_timeout
integer
Timeout for PowerBI M-query parsing in seconds. Table-level lineage is determined by analyzing the M-query expression. Increase this value if you encounter the 'M-Query Parsing Timeout' message in the connector report.
Default: 70
modified_since
string
Get only recently modified workspaces based on modified_since datetime '2023-02-10T00:00:00.0000000Z', excludeInActiveWorkspaces limit to last 30 days
native_query_parsing
boolean
Whether PowerBI native query should be parsed to extract lineage
Default: True
patch_metadata
boolean
Patch dashboard metadata
Default: True
platform_instance
string
The instance of the platform that all assets produced by this recipe belong to
scan_batch_size
integer
batch size for sending workspace_ids to PBI, 100 is the limit
Default: 1
scan_timeout
integer
timeout for PowerBI metadata scanning
Default: 60
workspace_id_as_urn_part
boolean
It is recommended to set this to True only if you have legacy workspaces based on Office 365 groups, as those workspaces can have identical names. To maintain backward compatibility, this is set to False which uses workspace name
Default: False
env
string
The environment that all assets produced by this connector belong to
Default: PROD
filter_dataset_endorsements
AllowDenyPattern
Filter and ingest datasets which are 'Certified' or 'Promoted' endorsement. If both are added, dataset which are 'Certified' or 'Promoted' will be ingested . Default setting allows all dataset to be ingested
Default: {'allow': ['.*'], 'deny': [], 'ignoreCase': True}
filter_dataset_endorsements.ignoreCase
boolean
Whether to ignore case sensitivity during pattern matching.
Default: True
filter_dataset_endorsements.allow
array
List of regex patterns to include in ingestion
Default: ['.*']
filter_dataset_endorsements.allow.string
string
filter_dataset_endorsements.deny
array
List of regex patterns to exclude from ingestion.
Default: []
filter_dataset_endorsements.deny.string
string
ownership
OwnershipMapping
Configure how is ownership ingested
Default: {'create_corp_user': True, 'use_powerbi_email': Tr...
ownership.create_corp_user
boolean
Whether ingest PowerBI user as Datahub Corpuser
Default: True
ownership.dataset_configured_by_as_owner
boolean
Take PBI dataset configuredBy as dataset owner if exist
Default: False
ownership.remove_email_suffix
boolean
Remove PowerBI User email suffix for example, @acryl.io
Default: False
ownership.use_powerbi_email
boolean
Use PowerBI User email to ingest as corpuser, default is powerbi user identifier
Default: True
ownership.owner_criteria
array
Need to have certain authority to qualify as owner for example ['ReadWriteReshareExplore','Owner','Admin']
ownership.owner_criteria.string
string
profile_pattern
AllowDenyPattern
Regex patterns to filter tables for profiling during ingestion. Note that only tables allowed by the table_pattern will be considered. Matched format is 'workspacename.datasetname.tablename'
Default: {'allow': ['.*'], 'deny': [], 'ignoreCase': True}
profile_pattern.ignoreCase
boolean
Whether to ignore case sensitivity during pattern matching.
Default: True
profile_pattern.allow
array
List of regex patterns to include in ingestion
Default: ['.*']
profile_pattern.allow.string
string
profile_pattern.deny
array
List of regex patterns to exclude from ingestion.
Default: []
profile_pattern.deny.string
string
server_to_platform_instance
One of map(str,union), map(str,union)
server_to_platform_instance.key.platform_instance
string
DataHub platform instance name. To generate correct urn for upstream dataset, this should match with platform instance name used in ingestion recipe of other datahub sources.
server_to_platform_instance.key.metastore 
string
Databricks Unity Catalog metastore name.
server_to_platform_instance.key.env
string
The environment that all assets produced by DataHub platform ingestion source belong to
Default: PROD
workspace_id_pattern
AllowDenyPattern
Regex patterns to filter PowerBI workspaces in ingestion. Note: This field works in conjunction with 'workspace_type_filter' and both must be considered when filtering workspaces.
Default: {'allow': ['.*'], 'deny': [], 'ignoreCase': True}
workspace_id_pattern.ignoreCase
boolean
Whether to ignore case sensitivity during pattern matching.
Default: True
workspace_id_pattern.allow
array
List of regex patterns to include in ingestion
Default: ['.*']
workspace_id_pattern.allow.string
string
workspace_id_pattern.deny
array
List of regex patterns to exclude from ingestion.
Default: []
workspace_id_pattern.deny.string
string
workspace_type_filter
array
Ingest the metadata of the workspace where the workspace type corresponds to the specified workspace_type_filter. Note: This field works in conjunction with 'workspace_id_pattern'. Both must be matched for a workspace to be processed.
Default: ['Workspace']
workspace_type_filter.enum
Enum
One of: "Workspace", "PersonalGroup", "Personal", "AdminWorkspace", "AdminInsights"
profiling
PowerBiProfilingConfig
Default: {'enabled': False}
profiling.enabled
boolean
Whether profiling of PowerBI datasets should be done
Default: False
stateful_ingestion
StatefulStaleMetadataRemovalConfig
PowerBI Stateful Ingestion Config.
stateful_ingestion.enabled
boolean
Whether or not to enable stateful ingest. Default: True if a pipeline_name is set and either a datahub-rest sink or datahub_api is specified, otherwise False
Default: False
stateful_ingestion.remove_stale_metadata
boolean
Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled.
Default: True

Code Coordinates

  • Class Name: datahub.ingestion.source.powerbi.powerbi.PowerBiDashboardSource
  • Browse on GitHub

Questions

If you've got any questions on configuring ingestion for PowerBI, feel free to ping us on our Slack.