Product Architecture
We organize files and directories into two distinct buckets: core
and custom
. When we release new versions of GA4Dataform, the installer only updates the contents of the core
directories, leaving custom
untouched. We designed this approach to prevent accidental overwrites of any customizations you've made after the initial installation.
Do not add new files to the core
directory or modify existing core
files directly. Any changes made to core
may be lost during updates. Place your custom
files outside the core
directory.
You can safely add new directories anywhere in the repository outside of core
. We recommend placing new directories within the custom
directory structure.
For example:
To add a new reporting directory, create it as: custom/reporting
Dataform Directories
Directory | Description |
---|---|
definitions | Contains all directories and files related to building models |
core/01_sources | Contains declarations.js and (future) staging models |
core/02_intermediate | Contains intermediate models |
core/03_outputs | Contains output models that should be used for downstream queries |
core/assertions | Contains all the assertions that check the data quality of our model |
core/extra | Contains any extra files that fulfill an individual purpose |
definitions/custom | Contains all the custom models that are not part of the core package |
includes | Contains all JS files with reusable variables and functions that help manage the repository |
includes/core/documentation | Contains the JSON files of table fields and descriptions of the output tables |
includes/core/extra | Contains all extra files that fulfill an individual purpose |
includes/custom | Contains all JS files that can be used to customize your setup (config.js ) |
Model Descriptions
Model | Description |
---|---|
int_ga4_sessions | GA4 intermediate sessions table that incrementally queries ga4_events table and creates session-level dimensions and metrics |
ga4_events | GA4 output events table that incrementally queries the raw GA4 export and applies partitioning, clustering, cleaning, and several fixes |
ga4_sessions | GA4 output sessions table that adds last non-direct click attribution and can be used for further transformations or aggregations |
demo_daily_sessions_report | Demo daily session aggregate table that can be connected Looker Studio for reporting |
demo_diagnostics | Demo diagnostics table that checks for several issues in the past 64 days |
source_categories | Materializes the core/extra/source_categories.json file into a table. Used for Default Channel Grouping |
JavaScript Files
File | Description |
---|---|
core/default_config.js | Contains default configuration options that are used as a fallback if custom/config.js is not populated |
core/helpers.js | Contains all helper functions that are used to produce SQL code for different use cases |
custom/config.js | Contains all configuration options that can be used to customize how and what data gets queried. It will always take precedence over core/default_config.js |
core/extra/source_categories.json | Contains which source category a domain should be treated as. Used for Default Channel Grouping |
Dataform Repository Structure
definitions
├── core
│ ├── 01_sources
│ │ ├── declarations.js
│ ├── 02_intermediate
│ │ ├── int_ga4_sessions.sqlx
│ ├── 03_outputs
│ │ ├── ga4_events.sqlx
│ │ ├── ga4_sessions.sqlx
│ ├── assertions
│ │ ├── assertion_logs.sqlx
│ │ ├── assertions_event_id_uniqueness.sqlx
│ │ ├── assertions_session_duration_validity.sqlx
│ │ ├── assertions_session_id_uniqueness.sqlx
│ │ ├── assertions_sessions_validity.sqlx
│ │ ├── assertions_tables_timeliness.sqlx
│ │ ├── assertions_transaction_id_completeness.sqlx
│ │ ├── assertions_user_pseudo_id_completeness.sqlx
│ ├── extra
│ │ ├── ga4
│ │ │ ├── source_categories.js
├── custom
│ ├── demo_daily_sessions_report.sqlx
│ ├── demo_diagnostics.sqlx
includes
├── core
│ ├── documentation
│ │ ├── ga4_events.json
│ │ ├── ga4_sessions.json
│ ├── extra
│ │ ├── source_categories.json
│ ├── default_config.js
│ ├── helpers.js
├── custom
│ ├── config.js
├── .gitignore
├── package-lock.json
├── package.json
├── workflow_settings.yaml
BigQuery Output
GA4Dataform produces tables to 3 datasets in BigQuery.
superform_outputs_123456: used for storing the output tables that should be used for downstream queries superform_quality_123456: used for storing the quality control results (assertions) superform_transformations_123456: used for storing the intermediate and staging tables that are used during the build process
If you leave the default dataset names untouched, you will see the following structure:
Dataform-package (project)
├── superform_outputs_123456 (dataset)
│ ├── demo_daily_sessions_report (tables)
│ ├── demo_diagnostics
│ ├── ga4_events
│ ├── ga4_sessions
├── superform_quality_123456
│ ├── assertion_logs
│ ├── assertions_event_id_uniqueness
│ ├── assertions_session_duration_validity
│ ├── assertions_session_id_uniqueness
│ ├── assertions_sessions_validity
│ ├── assertions_tables_timeliness
│ ├── assertions_transaction_id_completeness
│ ├── assertions_user_pseudo_id_completeness
├── superform_transformations_123456
│ ├── int_ga4_sessions
│ ├── source_categories