ETL Jobs Made Easy: Using Meltano to Extract from GitHub and Load into BigQuery
Meltano is an open-source platform that simplifies ETL jobs and provides an easy-to-use interface for data extraction, transformation, and loading.
ETL (Extract, Transform, Load) jobs are critical for data management and analysis, as they allow for the processing and transfer of data from various sources to target destinations. Meltano is an open-source platform that simplifies ETL jobs and provides an easy-to-use interface for data extraction, transformation, and loading. In this post, we will explore how to use Meltano to extract data from GitHub and load it into BigQuery.
Meltano
Meltano is an open-source platform for managing extract, transform, and load (ETL) jobs. It provides a user-friendly interface for creating and scheduling data pipelines, allowing users to easily extract data from various sources, transform it as needed, and load it into target destinations for analysis and visualization. Meltano is built on top of popular open-source frameworks, such as Airflow, dbt, and Singer, and provides a wide range of plugins for connecting to various data sources and destinations. With its powerful yet intuitive features, Meltano is an excellent choice for businesses and data analysts looking to manage ETL jobs with ease.
Extractors and Loaders
Extractors and loaders are two critical components of an ETL job. Extractors are responsible for retrieving data from various sources, such as APIs, databases, or files, and converting it into a standardized format that can be used for processing and analysis. In the example we used in this post, we used the tap-github extractor to extract data from the GitHub API and output it in JSONL format.
Loaders, on the other hand, are responsible for taking the extracted data and loading it into a target destination, such as a data warehouse, database, or file. In our example, we loaded the transformed data into BigQuery as our target destination.
Meltano provides a wide range of extractors and loaders, including plugins for popular data sources such as Salesforce, Google Analytics, and Stripe, as well as for popular target destinations such as Redshift, Snowflake, and PostgreSQL. This allows users to easily manage ETL jobs for a variety of data sources and destinations using a single platform.
By using extractors and loaders, we can easily extract, transform, and load data from various sources into a target destination, making it easier to manage and analyze large datasets. Meltano’s wide range of plugins and user-friendly interface make it a great choice for managing ETL jobs of any size and complexity.
Setting up Meltano
Before we get started, let’s ensure that we have Meltano installed on our system. Meltano can be installed via pip, the Python package installer, using the following command:
pip install meltano
Once Meltano is installed, we can create a new Meltano project using the following command:
meltano init <project-name>
This will create a new Meltano project in a directory with the specified project name.
Extracting Data from GitHub
In this example, we will extract data from the GitHub API using the Meltano tap-github plugin. The first step is to install the plugin using the following command:
meltano add extractor tap-github
You can use a variety of source extractors available. e.g. MySQL, Postgres, HubSpot, Amazon S3, Azure Blob Storage, CSV, JSON, and lot more.
Next, we need to configure the extractor. This can be done by creating a JSON configuration file in the project directory, e.g. config.json
, with the following contents:
{
"start_date": "2021-01-01T00:00:00Z",
"owner": "<owner>",
"repository": "<repository>",
"access_token": "<access-token>"
}
The start_date
field specifies the start date for the extraction, while the owner
and repository
fields specify the GitHub owner and repository from which to extract data. Finally, the access_token
field specifies the GitHub access token to use for authentication.
Once the configuration file has been created, we can run the extractor using the following command:
meltano elt tap-github target-jsonl --job_id=github --config=config.json
This will run the tap-github extractor and output the extracted data in JSONL format to stdout.
Transforming Data
After extracting the data, we may need to perform some transformations on it before loading it into BigQuery. Meltano provides a variety of transformation plugins, including ones for manipulating JSON and CSV data.
In this example, let’s say we want to filter the extracted GitHub data to only include issues that were closed in the last 30 days. We can use the Meltano transformation plugin tap-filter for this purpose. First, we need to install the plugin using the following command:
meltano add transformer tap-filter
Next, we need to create a JSON configuration file, e.g. transform-config.json
, with the following contents:
{
"filters": [
{
"field": "state",
"op": "=",
"value": "closed"
},
{
"field": "closed_at",
"op": ">=",
"value": "{{ ds.add(ds.now(), -30, 'days').strftime('%Y-%m-%dT%H:%M:%SZ') }}"
}
]
}
This configuration file specifies two filters: one to only include closed issues, and another to only include issues closed within the last 30 days.
We can now apply the transformation to the extracted GitHub data using the following command:
meltano transform tap-filter --config=transform-config.json < extracted-data.jsonl > transformed-data.jsonl
This will apply the filters specified in transform-config.json
to the extracted data in JSONL format and output the transformed.
Loading into BigQuery
After transforming our data using the tap-filter plugin, we used the target-bigquery plugin to load the data into a new table in BigQuery. The plugin requires authentication with Google Cloud, and users can specify the project ID, dataset name, and table name to load the data into. Meltano also allows users to configure the schema of the destination table using JSON files, providing greater flexibility and control over the data being loaded.
To use the target-bigquery plugin in our example of using Meltano to manage ETL jobs, we first need to install the plugin using the following command:
meltano add extractor target-bigquery
After installing the plugin, we can configure it by creating a JSON file that specifies the destination table schema and location in BigQuery. The schema defines the structure of the table, including the column names, data types, and any constraints. The location specifies the project ID, dataset name, and table name where the data will be loaded into.
Here’s an example of a JSON file that specifies the schema and location for loading our GitHub data into BigQuery:
{
"project": "my-project",
"dataset": "my-dataset",
"table": "github_data",
"schema": {
"type": "object",
"properties": {
"id": {"type": "string"},
"name": {"type": "string"},
"description": {"type": "string"},
"url": {"type": "string"},
"created_at": {"type": "string", "format": "date-time"},
"updated_at": {"type": "string", "format": "date-time"}
}
}
}
In this example, we are loading our GitHub data into a table called github_data
in the my-dataset
dataset in the my-project
project. The table has six columns: id
, name
, description
, url
, created_at
, and updated_at
.
After creating the JSON file, we can use the meltano elt
command to run our ETL job and load the transformed data into BigQuery using the target-bigquery plugin. Here's an example of the command:
meltano elt tap-github target-bigquery --job_id=my-github-to-bigquery-job --transform tap-filter --config-file config.json
In this command, we specify the source and target plugins (tap-github
and target-bigquery
, respectively), the ID of the job, the transform plugin (tap-filter
), and the configuration file for the target-bigquery plugin (config.json
). Once the job is complete, our GitHub data will be loaded into the specified table in BigQuery, ready for analysis and visualization.
Conclusion
In conclusion, Meltano is an excellent platform for managing ETL jobs and loading data into BigQuery. Its user-friendly interface, powerful plugins, and support for popular open-source frameworks make it a flexible and versatile tool for businesses and data analysts looking to manage their data pipelines with ease.
By using Meltano’s tap-github and target-bigquery plugins, we were able to extract data from GitHub, transform it using a custom filter, and load it into BigQuery for analysis and visualization. This process demonstrated how Meltano can simplify the ETL process and make it accessible to users of all levels of technical expertise.
Next Steps
Next steps for using Meltano could include exploring other plugins for connecting to different data sources and destinations, such as Salesforce, Hubspot, or Snowflake, and experimenting with different transformations and analysis tools. With its vast array of plugins and integrations, the possibilities for data management and analysis with Meltano are virtually endless.