Skip to main content

3 posts tagged with "etl"

View All Tags

· 8 min read
Fritz larco

Cloud Data Warehouses

In the past few years, we've seen a rapid growth in the usage of cloud data warehouses (as well as the "warehouse-first" paradigm). Two popualr cloud DWH platforms are BigQuery and Snowflake. Check out the chart below to see their evolution over time.

Image: Gartner via Adam Ronthal (@aronthal) on Twitter. Image: Gartner via Adam Ronthal (@aronthal) on Twitter.

BigQuery, standing at #4 as of 2021, is a fully-managed, serverless data warehouse service offered by Google Cloud Platform (GCP). It enables easy and scalable analysis over petabytes of data and has long been known for its ease of use and maintenance-free nature.

Snowflake, is a similar service offered by the company Snowflake Inc. One of the principal differences is that Snowflake allows you to host the instance in either Amazon Web Services (AWS), Azure (Microsoft) or GCP (Google). This is a great advantage if you are already established in a non-GCP environment.

Exporting and Loading the data

As circumstances have it, it is sometimes necessary or desired to copy data from a BigQuery environment into a Snowflake environment. Let's take a look and break down the various logical steps needed to properly move this data around since neither competing services have an integrated function to easily do this. For the sake of our example, we will assume that our destination Snowflake environment is hosted on AWS.

Step By Step Procedure

In order to migrate data from BigQuery to Snowflake (AWS), these are the essential steps:

  1. Identify table or query and execute EXPORT DATA OPTIONS query to export to Google Cloud Storage (GCS).
  2. Run script in VM or local machine to copy GCS data to Snowflake's Internal Stage. We could also read straight from GCS with a storage integration, but this involves another layer of secure access configuration (which may be preferable for your use case).
  3. Manually generate CREATE TABLE DDL with correct column data types and execute in Snowflake.
  4. Execute a COPY query in Snowflake to import staged files.
  5. Optionally clean up (delete) temporary data in GCP and Internal Stage.

Image: Steps to manually export from BigQuery to Snowflake.

As demonstrated above, there are several steps to make this happen, where independent systems need to be interacted with. This can be cumbersome to automate, especially generating the correct DDL (#3) with the proper column types in the destination system (which I personally find the most burdensome, try doing this for tables with 50+ columns).

Fortunately, there is an easier way to do this, and it is by using a nifty tool called Sling. Sling is a data integration tool which allows easy and efficient movement of data (Extract & Load) from/to Databases, Storage Platforms and SaaS applications. There are two ways of using it: Sling CLI & Sling Cloud. We will do the same procedure as above, but only by providing inputs to sling and it will automatically do the intricate steps for us!

Using Sling CLI

If you are a fanatic of the command line, Sling CLI is for you. It is built in go (which makes it super-fast), and it works with files, databases and various saas endpoints. It can also work with Unix Pipes (reads standard-input and writes to standard out). We can quickly install it from our shell:

# On Mac
brew install slingdata-io/sling/sling

# On Windows Powershell
scoop bucket add org https://github.com/slingdata-io/scoop-sling.git
scoop install sling

# Using Python Wrapper via pip
pip install sling

Please see here for other installation options (including Linux). There is also a Python wrapper library, which is useful if you prefer interacting with Sling inside of Python.

Once installed, we should be able to run the sling command, which should give us this output:

sling - An Extract-Load tool | https://slingdata.io
Slings data from a data source to a data target.
Version 0.86.52

Usage:
sling [conns|run|update]

Subcommands:
conns Manage local connections
run Execute an ad-hoc task
update Update Sling to the latest version

Flags:
--version Displays the program version string.
-h --help Displays help with available flag, subcommand, and positional value parameters.

Now there are many ways to configure tasks, but for our scope in this article, we first need to add connections credentials for BigQuery and Snowflake (a one time chore). We can do this by opening the file ~/.sling/env.yaml, and adding the credentials, which should look like this:

~/.sling/env.yaml
connections:

BIGQUERY:
type: bigquery
project: sling-project-123
location: US
dataset: public
gc_key_file: ~/.sling/sling-project-123-ce219ceaef9512.json
gc_bucket: sling_us_bucket # this is optional but recommended for bulk export.

SNOWFLAKE:
type: snowflake
username: fritz
password: my_pass23
account: abc123456.us-east-1
database: sling
schema: public

Great, now let's test our connections:

$ sling conns list
+------------+------------------+-----------------+
| CONN NAME | CONN TYPE | SOURCE |
+------------+------------------+-----------------+
| BIGQUERY | DB - Snowflake | sling env yaml |
| SNOWFLAKE | DB - PostgreSQL | sling env yaml |
+------------+------------------+-----------------+

$ sling conns test BIGQUERY
6:42PM INF success!

$ sling conns test SNOWFLAKE
6:42PM INF success!

Fantastic, now that we have our connections setup, we can run our task:

$ sling run --src-conn BIGQUERY --src-stream "select user.name, activity.* from public.activity join public.user on user.id = activity.user_id where user.type != 'external'" --tgt-conn SNOWFLAKE --tgt-object 'public.activity_user' --mode full-refresh
11:37AM INF connecting to source database (bigquery)
11:37AM INF connecting to target database (snowflake)
11:37AM INF reading from source database
11:37AM INF writing to target database [mode: full-refresh]
11:37AM INF streaming data
11:37AM INF dropped table public.activity_user
11:38AM INF created table public.activity_user
11:38AM INF inserted 77668 rows
11:38AM INF execution succeeded

Wow, that was easy! Sling did all the steps that we described prior automatically. We can even export the Snowflake data back to our shell sdtout (in CSV format) by providing just the table identifier (public.activity_user) for the --src-stream flag and count the lines to validate our data:

$ sling run --src-conn SNOWFLAKE --src-stream public.activity_user --stdout | wc -l
11:39AM INF connecting to source database (snowflake)
11:39AM INF reading from source database
11:39AM INF writing to target stream (stdout)
11:39AM INF wrote 77668 rows
11:39AM INF execution succeeded
77669 # CSV output includes a header row (77668 + 1)

Using Sling Cloud

Now let's do the same with the Sling Cloud app. Sling Cloud uses the same engine that Sling CLI does, except that is it a fully hosted platform to run all your Extract-Load needs at a competitive price (check out our pricing page). With Sling Cloud, we can:

  • Collaborate with many team members
  • Manage multiple workspace/projects
  • Schedule Extract-Load (EL) Tasks to run at an interval or fixed times (CRON)
  • Collect and analyze Logs for debugging
  • Error Notifications via Email or Slack
  • Run from world-wide regions or Self-Hosted Mode if desired (where Sling Cloud is the orchestrator).
  • Intuitive user interface (UI) for quick setup and execution

First thing is to sign up for a free account here. Once logged in, we can select the Cloud Mode (more on Self-Hosted mode later). Now we can perform similar steps as above, but using the Sling Cloud UI:

Adding the BigQuery Connection

Steps with Screenshots

Go to Connections, click New Connection, select Big Query.

bigquery-connection-step-1

Input name BIGQUERY, your credentials, and upload your google account JSON file. Click Test to test connectivity, then Save.

bigquery-connection-step-2

Adding the Snowflake Connection

Steps with Screenshots

Click New Connection, select Snowflake.

snowflake-connection-step-1

Input name SNOWFLAKE and your credentials. Click Test to test connectivity, then Save.

snowflake-connection-step-2

Create Replication

Steps with Screenshots

Go to Replications, click New Replication, select Big Query as source and Snowflake as destination. Click Next.

replication-connection-step-1

Adjust to Target Object Name Pattern if desired and click Create.

replication-connection-step-2

Create & Run Task

Steps with Screenshots

Go to Streams tab, click New SQL Stream since we are using a custom SQL as the source data. Give it a name (activity_user). Paste the SQL query, click Ok.

create-task-step-1

create-task-step-2

Now that we have a stream task ready, we can hit the Play icon to trigger it.

run-task-step-1

Once the task execution completes, we can inspect the logs.

run-task-step-2

That's it! We have our task setup and we can re-run it on-demand or set it on a schedule. As you may notice Sling Cloud handles a few more things for us and gives UI oriented users a better experience compared to Sling CLI.

Conclusion

We are in an era where data is gold, and moving data from one platform to another shouldn't be difficult. As we have demonstrated, Sling offers a powerful alternative by reducing friction associated with data integration. We'll be covering how to export from Snowflake and loading into BigQuery in another post.

· 6 min read
Fritz larco

background

Introduction to Sling CLI

Sling CLI is a command line tool which allows easy and efficient movement of data (Extract & Load) from/to Databases, Storage Platforms and SaaS applications. It's trivial to get started, you can simply run pip install sling if you have Python's pip installed. Or you can download the binary for your machine here.

Connection Credentials

In order to use sling, we must first configure connection credentials, and Sling CLI looks for them in various places. This allows a “plug & play” nature if you are already using another tool such as dbt, or have connection URLs set in environment variables. It is however recommended to use Sling’s env.yaml file as it allows a more consistent and flexible experience.

Sling Env File

The first time you run the sling command, the .sling folder is created in the current user’s home directory (~/.sling), which in turn holds a file called env.yaml. The structure for the Sling’s Env file is simple, you put your connections’ credential under the connections key as shown below:

connections:
marketing_pg:
url: 'postgres://...'
ssh_tunnel: 'ssh://...' # optional

# or dbt profile styled
marketing_pg:
type: postgres
host: [hostname]
user: [username]
password: [password]
port: [port]
dbname: [database name]
schema: [dbt schema]
ssh_tunnel: 'ssh://...'

finance_bq:
type: bigquery
method: service-account
project: [GCP project id]
dataset: [the name of your dbt dataset]
keyfile: [/path/to/bigquery/keyfile.json]

# global variables, available to all connections at runtime (optional)
variables:
aws_access_key: '...'
aws_secret_key: '...'

Please see here for all the accepted connection types and their respective data point needed.

When using the sling conns list command with Sling Env credentials, the SOURCE column will show as sling env yaml.

Environment variables

If you’d rather use environment variables, it suffices to set them in your shell environment the usual way:

# Mac / Linux
export MY_PG='postgresql://user:mypassw@pg.host:5432/db1'
export MY_SNOWFLAKE='snowflake://user:mypassw@sf.host/db1'
export ORACLE_DB='oracle://user:mypassw@orcl.host:1521/db1'

# Windows Powershell
set MY_PG 'postgresql://user:mypassw@pg.host:5432/db1'
set MY_SNOWFLAKE 'snowflake://user:mypassw@sf.host/db1'
set ORACLE_DB 'oracle://user:mypassw@orcl.host:1521/db1'

When using the sling conns list command with environment variables, the SOURCE column will show as env variable.

DBT Profiles

dbt is another popular tool that many data professionals use on a daily basis, and supporting existing local profiles allows easy cross-use. The typical location for the dbt credentials are in the ~/dbt/profiles.yml file. See here for more details.

If you have dbt credentials in place and use the sling conns list command, the SOURCE column will show as dbt profiles yaml.

The conns Sub-Command

Now that you have credentials set, sling offers a conns sub-command to interact with the connections. We can perform the following operations: list, test and discover.

$ sling conns -h
conns - Manage local connections

Usage:
conns [discover|list|test]

Subcommands:
discover list available streams in connection
list list local connections detected
test test a local connection

Flags:
--version Displays the program version string.
-h --help Displays help with available flag, subcommand, and positional value parameters.

Listing Connections

It's convenient to see and list all connections available in our environment. We can simply run the sling conns list command. Here is an example:

$ sling conns list
+----------------------+------------------+-------------------+
| CONN NAME | CONN TYPE | SOURCE |
+----------------------+------------------+-------------------+
| AWS_S3 | FileSys - S3 | sling env yaml |
| AZURE_STORAGE | FileSys - Azure | sling env yaml |
| BIGQUERY | DB - BigQuery | sling env yaml |
| BIONIC_DB1 | DB - PostgreSQL | dbt profiles yaml |
| BTD_S3 | FileSys - S3 | sling env yaml |
| CLICKHOUSE | DB - Clickhouse | sling env yaml |
| DEMO_POSTGRES | DB - PostgreSQL | sling env yaml |
| GITHUB_DBIO | API - GitHub | sling env yaml |
| NOTION | API - Notion | sling env yaml |
| SNOWFLAKE | DB - Snowflake | env variable |
| STEAMPIPE | DB - PostgreSQL | sling env yaml |
+----------------------+------------------+-------------------+

Testing Connections

The Sling CLI tool also allows testing connections. Once we know the connection name, we can use the sling conns test command:

$ sling conns test -h
test - test a local connection

Usage:
test [name]

Positional Variables:
name The name of the connection to test (Required)
Flags:
--version Displays the program version string.
-h --help Displays help with available flag, subcommand, and positional value parameters.

Here is an actual example:

$ sling conns test MSSQL
6:42PM INF success!

Discovering Connection Streams

This is another nifty sub-command that allows one to see which data streams are available for sling is read from for a particular connection: the sling conns discover command.

$ sling conns discover -h
discover - list available streams in connection

Usage:
discover [name]

Positional Variables:
name The name of the connection to test (Required)
Flags:
--version Displays the program version string.
-h --help Displays help with available flag, subcommand, and positional value parameters.
-f --filter filter stream name by pattern (e.g. account_*)
--folder discover streams in a specific folder (for file connections)
--schema discover streams in a specific schema (for database connections)

For database connections, it will list the available tables and views. For storage connections, it will list the non-recursive file objects located in the specified source folder. For SaaS/API connections, it will list all the available objects which exist for consumption. Below are some examples.

Database Connections

$ sling conns discover CLICKHOUSE
6:57PM INF Found 68 streams:
- "default"."docker_logs"
- "default"."sling_docker_logs"
- "system"."aggregate_function_combinators"
- "system"."asynchronous_metric_log"
- "system"."asynchronous_metrics"
- "system"."build_options"
- "system"."clusters"
....

If we want to filter for a specific shema, we can do:

$ sling conns discover CLICKHOUSE --schema default
8:29PM INF Found 2 streams:
- "default"."docker_logs"
- "default"."sling_docker_logs"

SaaS Connections

$ sling conns discover NOTION
6:58PM INF Found 4 streams:
- blocks
- databases
- pages
- users

Storage Connections

$ sling conns discover AWS_S3
6:52PM INF Found 7 streams:
- s3://my-sling-bucket/logging/
- s3://my-sling-bucket/part.01.0001.csv
- s3://my-sling-bucket/sling/
- s3://my-sling-bucket/temp/
- s3://my-sling-bucket/test.fs.write/
- s3://my-sling-bucket/test/
- s3://my-sling-bucket/test_1000.csv

If we want to see the files in a sub-folder, we can do this:

$ sling conns discover AWS_S3 --folder s3://my-sling-bucket/logging/
6:55PM INF Found 1 streams:
- s3://my-sling-bucket/logging/1/1.log.gz

Running EL Tasks

Now that your connections are set, you are ready to run some Extract and Load tasks! We cover this in detail in a separate post, you can read about it here. From the command line, you can also run sling run -e which will print a bunch of examples.

Conclusion

All things considered, Sling CLI makes it easy to manage and interact with various types of connections from your shell. If you have any questions or suggestions, feel free to contact us here. Furthermore, make sure to check out Sling Cloud, which is a cloud service handling your various Extract & Load needs at a competitive pricing model.

· One min read
Fritz larco

background

Sling is a go-powered, modern data integration tool Extracting and Loading data from popular data sources to destinations with high performance and ease.

Why Sling?

  • Blazing fast performance - Core engine is written in Go and adopts a streaming design, making it super efficient by holding minimal data in memory.
  • Replicate data quickly - Easily replicate data from a source database, file or SaaS connection to a destination database or file..
  • Transparent & Low Cost - Sling operates on an efficient and low-cost model. Our goal is to be transparent with the cost of using our platform.

Learn more about how Sling works Sign up for Sling