`dbcrossbar` Guide

dbcrossbar is an open source tool that copies large, tabular datasets between many different databases and storage formats. Data can be copied from any source to any destination.

An example

If we have a CSV file my_table.csv containing data:

id,name,quantity
1,Blue widget,10
2,Red widget,50

And a file my_table.sql containing a table definition:

CREATE TABLE my_table (
    id INT NOT NULL,
    name TEXT NOT NULL,
    quantity INT NOT NULL
);

Then we can use these to create a PostgreSQL table:

dbcrossbar cp \
    --if-exists=overwrite \
    --schema=postgres-sql:my_table.sql \
    csv:my_table.csv \
    'postgres://postgres@127.0.0.1:5432/postgres#my_table'

If we want to use this data to update an existing table in BigQuery, we can upsert into BigQuery using the id column:

dbcrossbar config add temporary gs://$GS_TEMP_BUCKET
dbcrossbar config add temporary bigquery:$GCLOUD_PROJECT:temp_dataset
dbcrossbar cp \
    --if-exists=upsert-on:id \
    'postgres://postgres@127.0.0.1:5432/postgres#my_table' \
    bigquery:$GCLOUD_PROJECT:my_dataset.my_table

This will stream the data out of PostgreSQL, upload it to $GS_TEMP_BUCKET, import it into a temporary BigQuery table, and run an appropriate MERGE command. The config add temporary commands tell dbcrossbar what cloud bucket and BigQuery dataset should be used for temporary files and tables, respectively.

Notice that we don't need to specify --schema, because dbcrossbar will automatically translate the PostgreSQL column types to corresponding BigQuery types.

Credits

dbcrossbar is generously supported by Faraday and by open source contributors. Please see the credits for more information.

Features & philosophy

dbcrossbar is designed to do a few things well. Typically, dbcrossbar is used for loading raw data, and for moving data back and forth between production databases and data warehouses. It supports a few core features:

Copying tables.
Counting records in tables.
Converting between different table schema formats, including PostgreSQL CREATE TABLE statements and BigQuery schema JSON.

dbcrossbar offers a number of handy features:

A single static binary on Linux, with no dependencies.
A stream-based architecture that limits the use of RAM and requires no temporary files.
Support for appending, overwriting or upserting into existing tables.
Support for selecting records using --where.

dbcrossbar also supports a rich variety of portable column types:

Common types, including booleans, dates, timestamps, floats, integers, and text.
UUIDs.
JSON.
GeoJSON.
Arrays.

Non-features

The following features are explicitly excluded from dbcrossbar's mission:

Data cleaning and transformation.
Fixing invalid column names.
Copying multiple tables at time.
Automatically copying constraints, foreign keys, etc.

If you need these features, then take a look at tools like scrubcsv and pgloader.

Installing

Pre-built binaries for dbcrossbar are available on GitHub. These currently include:

Fully-static Linux x86_64 binaries, which should work on any modern distribution (including Alpine Linux containers).
MacOS X binaries. Note that these are unsigned, and you may need to use xattr -d com.apple.quarantine dbcrossbar and spctl --add dbcrossbar to make them runnable.

Windows binaries are not available at this time, but it may be possible to build them with a little work.

Required tools

To use the S3 and RedShift drivers, you will need to install the AWS CLI tools. We plan to replace the AWS CLI tools with native Rust libraries before the 1.0 release.

Installing using `cargo`

You can also install dbcrossbar using cargo. First, you will need to make sure you have the necessary C dependencies installed:

# Ubuntu Linux (might be incomplete).
sudo apt install build-essential

Then, you can install using cargo:

cargo install dbcrossbar

Building from source

The source code is available on GitHub. First, install the build dependencies as described above. Then run:

git clone https://github.com/dbcrossbar/dbcrossbar.git
cd dbcrossbar
cargo build --release

This will create target/release/dbcrossbar.

How it works

dbcrossbar uses pluggable input and output drivers, allowing any input to be copied to any output:

Parallel data streams

Internally, dbcrossbar uses parallel data streams. If we copy s3://example/ to csv:out/ using --max-streams=4, this will run up to 4 copies in parallel:

As soon as one stream finishes, a new one will be started:

dbcrossbar accomplishes this using a stream of CSV streams. This allows us to make extensive use of backpressure to control how data flows through the system, eliminating the need for temporary files. This makes it easier to work with 100GB+ CSV files and 1TB+ datasets.

Shortcuts

When copying between certain drivers, dbcrossbar supports "shortcuts." For example, it can load data directly from Google Cloud Storage into BigQuery.

Multi-threaded, asynchronous Rust

dbcrossbar is written using asynchronous Rust, and it makes heavy use of a multi-threaded worker pool. Internally, it works something like a set of classic Unix pipelines running in parallel. Thanks to Rust, it has been possible to get native performance and multithreading without spending too much time debugging.

CSV interchange format

Internally, dbcrossbar converts all data into CSV streams. For many standard types, all input drivers are required to provide byte-for-byte identical CSV data:

id,test_bool,test_date,test_int16,test_int32,test_int64,test_text,test_timestamp_without_time_zone,test_timestamp_with_time_zone,test_uuid,select,testCapitalized,test_enum
1,t,1969-07-20,-32768,-2147483648,-9223372036854775808,hello,1969-07-20T20:17:39,1969-07-20T20:17:39Z,084ec3bb-3193-4ffb-8b74-99a288e8432c,,,red
2,f,2001-01-01,32767,2147483647,9223372036854775807,,,,,,,green
3,,,,,,,,,,,,

For more complex types such as arrays, structs, JSON, and GeoJSON data, we embed JSON into the CSV file:

test_null,test_not_null,test_bool,test_bool_array,test_date,test_date_array,test_float32,test_float32_array,test_float64,test_float64_array,test_geojson,test_geojson_3857,test_int16,test_int16_array,test_int32,test_int32_array,test_int64,test_int64_array,test_json,test_text,test_text_array,test_timestamp_without_time_zone,test_timestamp_without_time_zone_array,test_timestamp_with_time_zone,test_timestamp_with_time_zone_array,test_uuid,test_uuid_array,test_enum
,hi,t,"[true,false]",1969-07-20,"[""1969-07-20""]",1e+37,"[1e-37,0,100.125,1e+37]",1e+37,"[1e-37,0,1000.125,1e+37]","{""type"":""Point"",""coordinates"":[-71,42]}","{""type"":""Point"",""coordinates"":[-71,42]}",16,"[-32768,0,32767]",32,"[-2147483648,0,2147483647]",64,"[""-9223372036854775808"",""0"",""9223372036854775807""]","{""x"": 1, ""y"": 2}",hello,"[""hello"",""""]",1969-07-20T20:17:39.5,"[""1969-07-20T20:17:39.5""]",1969-07-20T20:17:39.5Z,"[""1969-07-20T20:17:39.5Z""]",084ec3bb-3193-4ffb-8b74-99a288e8432c,"[""084ec3bb-3193-4ffb-8b74-99a288e8432c""]",red

Tricks for preparing CSV data

If your input CSV files use an incompatible format, there are several things that might help. If your CSV files are invalid, non-standard, or full of junk, then you may be able to use scrubcsv or xsv to fix the worst problems.

If you need to clean up your data manually, then you may want to consider using dbcrossbar to load your data into BigQuery, and set your columns to type STRING. Once this is done, you can parse and normalize your data quickly using SQL queries.

Portable table schema

Internally, dbcrossbar uses a portable table "schema" format. This provides a common ground between PostgreSQL's CREATE TABLE statements, BigQuery's JSON schemas, and equivalent formats for other databases. For more information, see:

All table schemas and column types are converted into the portable format and then into the appropriate destination format.

Normally, you won't need to work with this schema format directly, because dbcrossbar can parse BigQuery schemas, PostgreSQL CREATE TABLE statments, and several other popular schema formats. It can also read schemas directly from some databases. See the conv command for details.

Example schema

{
    "named_data_types": [
        {
            "name": "format",
            "data_type": {
                "one_of": [
                    "gif",
                    "jpeg"
                ]
            }
        }
    ],
    "tables": [
        {
            "name": "images",
            "columns": [
                {
                    "name": "id",
                    "is_nullable": false,
                    "data_type": "uuid"
                },
                {
                    "name": "url",
                    "is_nullable": false,
                    "data_type": "text"
                },
                {
                    "name": "format",
                    "is_nullable": true,
                    "data_type": {
                        "named": "format"
                    }
                },
                {
                    "name": "metadata",
                    "is_nullable": true,
                    "data_type": "json"
                },
                {
                    "name": "thumbnails",
                    "is_nullable": true,
                    "data_type": {
                        "array": {
                            "struct": [
                                {
                                    "name": "url",
                                    "is_nullable": false,
                                    "data_type": "text"
                                },
                                {
                                    "name": "width",
                                    "data_type": "float64",
                                    "is_nullable": false
                                },
                                {
                                    "name": "height",
                                    "data_type": "float64",
                                    "is_nullable": false
                                }
                            ]
                        }
                    }
                }
            ]
        }
    ]
}

Schema properties

named_data_types (experimental): Named data types. These are used to declare custom column types. They're analogous to a Postgres CREATE TYPE statement, or a TypeScript interface or type statement. Two types with different names but the same layout are considered to be different types (nominal typing). Many database drivers place restrictions on these types for now, but we hope to relax those restrictions in the future.
tables: A list of table definitions. For now, this must contain exactly one element.

Table properties

name: The name of this table. This is normally only used when serializing to schema formats that require a table name.
columns: A list of columns in the table.

Column properties

name: The name of the column.
is_nullable: Can the column contain NULL values?
data_type: The type of data stored in the column.

Data types

The data_type field can contain any of:

{ "array": element_type }: An array of element_type values.
"bool": A boolean value.
"date": A date, with no associated time value.
"decimal": A decimal integer (can represent currency, etc., without rounding errors).
"float32": A 32-bit floating point number.
"float64": A 64-bit floating point number.
{ "geo_json": srid }: Geodata in GeoJSON format, using the specified SRID, to specify the spatial reference system.
"int16": A 16-bit signed integer.
"int32": A 32-bit signed integer.
"int64": A 64-bit signed integer.
"json": An arbitrary JSON value.
{ "named": name }: A named type from the named_types list in this schema.
{ "one_of": string_list }: One of the specified string values. This may be represented a string enumeration by certain drivers, or as a "categorical" value in machine-learning systems (as opposed to a free-form textual value). NOTE: In many cases, it would be more efficient to have a separate table with an (id, value) entry for each enum value, and to refer to it using a foreign key. one_of is most useful when importing cleaned data, and when working with databases that support efficient string enumerations.
{ "struct": fields }: A structure with a list of specific, named fields. Each field has the following properties:
- name: The name of the field.
- is_nullable: Can the field contain NULL values?
- data_type: The type of data stored in the field.
"text": A string.
"timestamp_without_time_zone": A date and time without an associated timezone.
"timestamp_with_time_zone": A date and time with an associated timezone.
"uuid": A UUID value.

Configuring `dbcrossbar`

dbcrossbar can read information from a configuration directory. By default, this can be found at:

Linux: ~/.config
MacOS: ~/Library/Preferences

To override this default location, you can set DBCROSSBAR_CONFIG_DIR to point to an alternate configuration directory.

If a file dbcrossbar.toml appears in this directory, dbcrossbar will read its configuration from that file. Other files may be placed in this directory, including certain local credential files.

Modifying the configuration file

You can modify the dbcrossbar.toml file from the command line using the config subcommand. For example:

dbcrossbar config add temporary s3://example/temp/
dbcrossbar config rm temporary s3://example/temp/

Using config add temporary allows you to specify default values for --temporary flags. You can still override specific defaults by passing --temporary to commands that use it.

Commands

dbcrossbar supports four main subcommands:

dbcrossbar cp: Copy tabular data.
dbcrossbar count: Count records.
dbcrossbar schema conv: Convert table schemas between databases.

For more information, type dbcrossbar --help or dbcrossbar $CMD --help.

Not all drivers support all the features of each command. To see the available drivers and what commands they support, run dbcrossbar features and dbcrossbar features $DRIVER_NAME.

`cp`: Copying tables

The cp command copies tabular data from a source location to a destination location. For example, we can copy a CSV file into PostgreSQL, replacing any existing table:

dbcrossbar cp \
    --if-exists=overwrite \
    --schema=postgres-sql:my_table.sql \
    csv:my_table.csv \
    'postgres://postgres@127.0.0.1:5432/postgres#my_table'

Or we copy data from PostgreSQL and upsert it into a BigQuery table:

dbcrossbar config add temporary gs://$GS_TEMP_BUCKET
dbcrossbar config add temporary bigquery:$GCLOUD_PROJECT:temp_dataset
dbcrossbar cp \
    --if-exists=upsert-on:id \
    'postgres://postgres@127.0.0.1:5432/postgres#my_table' \
    bigquery:$GCLOUD_PROJECT:my_dataset.my_table

Command-line help


Usage: dbcrossbar cp [OPTIONS] <FROM_LOCATOR> <TO_LOCATOR>

Arguments:
  <FROM_LOCATOR>  The input table
  <TO_LOCATOR>    The output table

Options:
      --if-exists <IF_EXISTS>      One of `error`, `overwrite`, `append` or `upsert-on:COL` [default:
                                   error]
      --schema <SCHEMA>            The schema to use (defaults to input table schema)
      --temporary <TEMPORARIES>    Temporary directories, cloud storage buckets, datasets to use during
                                   transfer (can be repeated)
      --stream-size <STREAM_SIZE>  Specify the approximate size of the CSV streams manipulated by
                                   `dbcrossbar`. This can be used to split a large input into multiple
                                   smaller outputs. Actual data streams may be bigger or smaller
                                   depending on a number of factors. Examples: "100000", "1Gb"
      --from-arg <FROM_ARGS>       Pass an extra argument of the form `key=value` to the source driver
      --from-format <FROM_FORMAT>  For directory- and file-like data sources, the format to assume. If
                                   not specified, `dbcrossbar` will use the file extension to guess the
                                   format
      --to-arg <TO_ARGS>           Pass an extra argument of the form `key=value` to the destination
                                   driver
  -F, --to-format <TO_FORMAT>      For directory-like data destinations, the format to use. If not
                                   specified, `dbcrossbar` will use the destination file extension (if
                                   provided) or `csv`
      --where <WHERE_CLAUSE>       SQL where clause specifying rows to use
  -J, --max-streams <MAX_STREAMS>  How many data streams should we attempt to copy in parallel? [default:
                                   4]
      --display-output-locators    Display where we wrote our output data
  -h, --help                       Print help

EXAMPLE LOCATORS:
    postgres://localhost:5432/db#table
    bigquery:project:dataset.table

Flags

Not all command-line options are supported by all drivers. See the chapter on each driver for details.

`--where`

Specify a WHERE clause to include in the SQL query. This can be used to select a subset of the source rows.

`--from-arg`

This can be used to specify driver-specific options for the source driver. See the chapter for that driver.

`--if-exists=error`

If the destination location already contains data, exit with an error.

`--if-exists=append`

If the destination location already contains data, append the new data.

`--if-exists=overwrite`

If the destination location already contains data, replace it with the new data.

`--if-exists=upset-on:COL1,..`

For every row in the new data:

If a row with a matching col1, col2, etc., exists, use the new data to update that row.
If no row matching col1, col2, etc., exists, then insert the new row.

The columns col1, col2, etc., must be marked as NOT NULL.

`--schema`

By default, dbcrossbar will use the schema of the source table. But when this can't be inferred automatically, --schema can be used to specify a table schema:

--schema=postgres-sql:my_table.sql: A PostgreSQL CREATE TABLE statement.
--schema=bigquery-schema:my_table.json: A BigQuery JSON schema.
--schema=dbcrossbar-schema:my_table.json: An internal dbcrossbar schema.

It's also possible to use a schema from an existing database table:

--schema=postgres://localhost:5432/db#table
--schema=bigquery:project:dataset.table

Note that it's possible to create a BigQuery table using a PostgreSQL schema, or vice versa. Internally, all schemes are first converted to the internal schema format.

`--temporary`

Specify temporary storage, which is required by certain drivers. Typical values include:

--temporary=s3://$S3_TEMP_BUCKET
--temporary=gs://$GS_TEMP_BUCKET
--temporary=bigquery:$GCLOUD_PROJECT:temp_dataset

`--to-arg`

This can be used to specify driver-specific options for the destination driver. See the chapter for that driver.

count: Counting records

This command mostly works like the cp command, except that it prints out a number of rows. Check your driver to see if it supports count.

Command-line help


Usage: dbcrossbar count [OPTIONS] <LOCATOR>

Arguments:
  <LOCATOR>  The locator specifying the records to count

Options:
      --schema <SCHEMA>            The schema to use (defaults to input table schema)
      --temporary <TEMPORARIES>    Temporary directories, cloud storage buckets, datasets to use during
                                   transfer (can be repeated)
      --from-arg <FROM_ARGS>       Pass an extra argument of the form `key=value` to the source driver
      --from-format <FROM_FORMAT>  For directory- and file-like data sources, the format to assume. If
                                   not specified, `dbcrossbar` will use the file extension to guess the
                                   format
      --where <WHERE_CLAUSE>       SQL where clause specifying rows to use
  -h, --help                       Print help

EXAMPLE LOCATORS:
    postgres://localhost:5432/db#table
    bigquery:project:dataset.table

`schema conv`: Transforming schemas

The schema conv command can be used to convert between different database schemas. To convert from a PostgreSQL CREATE TABLE statement to a BigQuery schema, run:

dbcrossbar schema conv postgres-sql:table.sql bigquery-schema:table.json

As a handy trick, you can also use a CSV source, which will generate a CREATE TABLE where all columns have the type TEXT:

dbcrossbar schema conv csv:data.csv postgres-sql:table.sql

This can then be edited to specify appropriate column types.

Command-line help


Usage: dbcrossbar schema conv [OPTIONS] <FROM_LOCATOR> <TO_LOCATOR>

Arguments:
  <FROM_LOCATOR>  The input schema
  <TO_LOCATOR>    The output schema

Options:
      --if-exists <IF_EXISTS>  One of `error`, `overrwrite` or `append` [default: error]
      --from-arg <FROM_ARGS>   Pass an extra argument of the form `key=value` to the source driver
      --to-arg <TO_ARGS>       Pass an extra argument of the form `key=value` to the destination driver
  -h, --help                   Print help

EXAMPLE LOCATORS:
    postgres-sql:table.sql
    postgres://localhost:5432/db#table
    bigquery-schema:table.json

Drivers

dbcrossbar uses built-in "drivers" to read and write CSV data and table schemas.

BigML

BigML is a hosted machine-learning service, with support for many common algorithms and server-side batch scripts.

Example locators

Source locators:

bigml:dataset/$ID: Read data from a BigML dataset.

Destination locators:

bigml:source: Create a single BigML "source" resource from the input data.
bigml:sources: Create multiple BigML "source" resources from the input data.
bigml:dataset: Create a single BigML "dataset" resource from the input data.
bigml:datasets: Create multiple BigML "dataset" resources from the input data.

If you use BigML as a destination, dbcrossbar will automatically activate --display-output-locators, and it will print locators for all the created resources on standard output. Column types on created "source" resources will be set something appropriate (but see optype_for_text below.)

Configuration & authentication

The BigML driver requires more configuration than most.

You'll need to set the following environment variables:

BIGML_USERNAME: Set this to your BigML username.
BIGML_API_KEY: Set this to your BigML API key.
BIGML_DOMAIN (optional): Set this to the domain name of your BigML instance, if it's not located at the standard address.
AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY: Set these to your AWS credentials when using BigML as a destination. Do not set AWS_SESSION_TOKEN; it will not work with BigML.

You'll also need to pass the following on the command line when using:

--temporary=s3://$S3_TEMP_BUCKET: Specify where to stage files for loading into BigML. This is not needed when using BigML as a source.

You can also specify the following --to-arg values:

name: The human-readable name of the resource to create.
optype_for_text: The BigML optype to use for text fields. This defaults to text. You may want to set it to categorical if your text fields contain a small set of fixed strings. But you should probably use dbcrossbar's { "one_of": string_list } types instead, which will always map to categorical.
tag: This may be specified repeatedly to attach tags to the created resources.

Supported features

bigml features:
- conv FROM
- cp FROM:
- cp TO:
  --to-arg=$NAME=$VALUE
  --if-exists=error --if-exists=append --if-exists=overwrite --if-exists=upsert-on:col

Note that --if-exists is simply ignored, because BigML will always create new resources.

BigQuery

Google's BigQuery is a extremely scalable data warehouse that supports rich SQL queries and petabytes of data. If you need to transform or analyze huge data sets, it's an excellent tool.

When loading data into BigQuery, or extracting it, we always go via Google Cloud Storage. This is considerably faster than the load and extract functionality supplied by tools like bq.

Example locators

bigquery:$PROJECT:$DATASET.$TABLE: A BigQuery table.
bigquery-test-fixture:$PROJECT:$DATASET.$TABLE: If you only need a tiny, read-only "table" for testing purposes, you may want to try the bigquery-test-fixture: locator. It currently uses tables.insert to pass a table.view.query with all the table data inlined into the VIEW SQL. This runs about 20 times faster than bigquery:, at the expense of not creating a regular table. Note that the implementation details of this method may change, if we discover a faster or better way to create a small, read-only table.

Configuration & authentication

See the Cloud Storage driver for authentication details.

The following command-line options will usually need to be specified for both sources and destinations:

--temporary=gs://$GS_TEMP_BUCKET: A Google Cloud Storage bucket to use for staging data in both directions.
--temporary=bigquery:$GCLOUD_PROJECT:temp_dataset

You can also specify Google Cloud resource labels to apply to all BigQuery jobs. Labels are often used to track query costs.

--from-arg=job_labels[department]=marketing
--to-arg=job_labels[project]=project1

You can pick a project to run jobs in, for example for billing purposes.

--from-arg=job_project_id=my-gcp-billing-project
--to-arg=job_project_id=my-gcp-billing-project

Supported features

bigquery features:
- conv FROM
- count
  --from-arg=$NAME=$VALUE --where=$SQL_EXPR
- cp FROM:
  --from-arg=$NAME=$VALUE --where=$SQL_EXPR
- cp TO:
  --to-arg=$NAME=$VALUE
  --if-exists=error --if-exists=append --if-exists=overwrite --if-exists=upsert-on:col

CSV

dbcrossbar works with valid CSV files in our CSV interchange format. For invalid CSV files, take a look at scrubcsv. For CSV files which need further transformation and parsing, considering loading them into BigQuery and cleaning them up using SQL. This works very well even for large datasets.

Example locators

The following locators can be used for both input and output:

csv:file.csv: A single CSV file.
csv:dir/: A directory tree containing CSV files.
csv:-: Read from standard input, or write to standard output.

To concatenate CSV files, use:

dbcrossbar cp csv:input/ csv:merged.csv

To split a CSV file, use --stream-size:

dbcrossbar cp --stream-size="100Mb" csv:giant.csv csv:split/

Configuration & authentication

None.

Supported features

csv features:
- conv FROM
- cp FROM:
- cp TO:
  --if-exists=error --if-exists=overwrite

Files

dbcrossbar can read and/or write files in a number of formats, including:

Our CSV interchange format. For invalid CSV files, take a look at scrubcsv. For CSV files which need further transformation and parsing, considering loading them into BigQuery and cleaning them up using SQL. This works very well even for large datasets.
JSON Lines (input only).

Example locators

The following locators can be used for both input and output (if supported by the format):

file:file.csv: A single CSV file.
file:dir/: A directory tree containing CSV files.
file:-: Read from standard input, or write to standard output.

To concatenate CSV files, use:

dbcrossbar cp csv:input/ csv:merged.csv

To split a CSV file, use --stream-size:

dbcrossbar cp --stream-size="100Mb" csv:giant.csv csv:split/

Configuration & authentication

None.

Supported features

file features:
- cp FROM:
  --format=$FORMAT
- cp TO:
  --format=$FORMAT
  --if-exists=error --if-exists=overwrite

Google Cloud Storage

Google Cloud Storage is a bucket-based storage system similar to Amazon's S3. It's frequently used in connection with BigQuery and other Google Cloud services.

Example locators

Source locators:

gs://bucket/dir/file.csv
gs://bucket/dir/

Destination locators:

gs://bucket/dir/

At this point, we do not support single-file output to a cloud bucket. This is relatively easy to add, but has not yet been implemented.

Configuration & authentication

You can authenticate using either a client secret or a service key, which you can create using the console credentials page.

Client secrets can be stored in $DBCROSSBAR_CONFIG_DIR/gcloud_client_secret.json or in GCLOUD_CLIENT_SECRET. These are strongly recommended for interactive use.
Service account keys can be stored in $DBCROSSBAR_CONFIG_DIR/gcloud_service_account_key.json or in GCLOUD_SERVICE_ACCOUNT_KEY. These are recommended for server and container use.

For more information on DBCROSSBAR_CONFIG_DIR, see Configuration.

For a service account, you can use the following permissions:

Storage Object Admin (Cloud Storage and BigQuery drivers)
BigQuery Data Editor (BigQuery driver only)
BigQuery Job User (BigQuery driver only)
BigQuery User (BigQuery driver only)

There's probably a more limited set of permissions which will work if you set them up manually.

Supported features

gs features:
- cp FROM:
- cp TO:
  --if-exists=overwrite

PostgreSQL

PostgreSQL is an excellent general-purpose SQL database.

Example locators

dbcrossbar supports standard PostgreSQL locators followed by #table_name:

postgres://postgres:$PASSWORD@127.0.0.1:5432/postgres#my_table

Note that PostgreSQL sources will currently output all data as a single stream. This can be split into multiple streams using the --stream-size option if desired.

Configuration & authentication

Authentication is currently handled using standard postgres://user:pass@... syntax, similar to psql. We may add alternative mechanisms at some point to avoid passing credentials on the command-line.

Supported features

postgres features:
- conv FROM
- count
  --where=$SQL_EXPR
- cp FROM:
  --where=$SQL_EXPR
- cp TO:
  --if-exists=error --if-exists=append --if-exists=overwrite --if-exists=upsert-on:col

RedShift

Amazon's Redshift is a cloud-based data warehouse designed to support analytical queries. This driver receives less testing than our BigQuery driver, because the cheapest possible RedShift test system costs over $100/month. Sponsors are welcome!

Example locators

These are identical to PostgreSQL locators, except that postgres is replaced by redshift:

redshift://postgres:$PASSWORD@127.0.0.1:5432/postgres#my_table

Configuration & authentication

Authentication is currently handled using the redshift://user:pass@... syntax. We may add alternative mechanisms at some point to avoid passing credentials on the command-line.

The following environment variables are required.

AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY: Set these to your AWS credentials.
AWS_SESSION_TOKEN (optional): This should work, but it hasn't been tested.

The following --temporary flag is required:

--temporary=s3://$S3_TEMP_BUCKET: Specify where to stage files for loading or unloading data.

Authentication credentials for COPY may be passed using --to-arg. For example:

--to-arg=iam_role=$ROLE
--to-arg=region=$REGION

This may require some experimentation.

If you need to generate "-- partner:" SQL comments for an AWS RedShift partner program, you can do it as follows:

--to-arg=partner="myapp v1.0"

Supported features

redshift features:
- conv FROM
- cp FROM:
  --from-arg=$NAME=$VALUE --where=$SQL_EXPR
- cp TO:
  --to-arg=$NAME=$VALUE
  --if-exists=error --if-exists=append --if-exists=overwrite --if-exists=upsert-on:col

S3

Amazon's S3 is a bucket-based system for storing data in the cloud.

Example locators

Source locators:

s3://bucket/dir/file.csv
s3://bucket/dir/

Destination locators:

s3://bucket/dir/

At this point, we do not support single-file output to a cloud bucket. This is relatively easy to add, but has not yet been implemented.

Configuration & authentication

The following environment variables are used to authenticate:

AWS_ACCESS_KEY_ID (required): The ID for your AWS credentials.
AWS_SECRET_ACCESS_KEY (required): The secret part of your AWS credentials.
AWS_SESSION_TOKEN (optional): Set this to use temporary AWS crdentials.
AWS_DEFAULT_REGION (required): Set this to your AWS region.

Supported features

s3 features:
- cp FROM:
- cp TO:
  --if-exists=overwrite

Shopify (UNSTABLE)

WARNING: This is highly experimental and will likely be removed in a future release in favor of the new JSON Lines support. To use it for now, you must enable it using the --enable-unstable flag.

Shopify is an online e-commerce platform with a REST API for fetching data.

Example locators

Locators look just like Shopify REST API URLs, but with https: replaced with shopify:

shopify://$SHOP/admin/api/2020-04/orders.json?status=any

For a schema, download shopify.ts, and refer to it as follows:

--schema="dbcrossbar-ts:shopify.ts#Order"

We do not currently include a default Shopify schema in dbcrossbar itself, because it's still undergoing significant changes.

Configuration & authentication

The following environment variables are required:

SHOPIFY_AUTH_TOKEN: The Shopify authorization token to use. (We don't yet support password authentication, but it would be easy enough to add.)

Supported features

shopify features:
- cp FROM:

This driver is UNSTABLE and may change without warning.

Schema drivers

dbcrossbar allows you to specify a table's column names and types in a number of different ways. You can use Postgres CREATE TABLE statements, or BigQuery schema JSON, or dbcrossbar's internal schema format.

These schema formats are typically used in one of two ways:

As a --schema argument to the cp subcommand.

dbcrossbar cp \
  --if-exists=overwrite \
  --schema=postgres-sql:my_table.sql \
  csv:my_table.csv \
  'postgres://postgres@127.0.0.1:5432/postgres#my_table'

As an argument to the conv subcommand, which allows you to convert between different schema formats.
```
dbcrossbar schema conv postgres-sql:table.sql bigquery-schema:table.json
```

Postgres `CREATE TABLE` statements

To specify the column names and types for table in SQL format, use:

--schema postgres-sql:my_table.sql

The file my_table.sql can contain a single CREATE TABLE statement using a subset of PostgreSQL's syntax:

CREATE TABLE my_table (
    id INT NOT NULL,
    name TEXT NOT NULL,
    quantity INT NOT NULL
);

Limitations

This schema format offers support for singly-nested array types, and it doesn't support structure types at all.

BigQuery JSON schemas

To specify the column names and types for table in BigQuery JSON format, use:

--schema bigquery-schema:my_table.json

The file my_table.json should be a BigQuery JSON schema file:

[
  {
    "name": "id",
    "type": "INT64",
    "mode": "REQUIRED"
  },
  {
    "name": "name",
    "type": "STRING",
    "mode": "REQUIRED"
  },
  {
    "name": "quantity",
    "type": "INT64",
    "mode": "REQUIRED"
  }
]

Limitations

This schema format supports a small number of general types. For example, all integer types are represented as INT64, all floating-point types are represented as FLOAT64, and both JSON values and UUIDs are represented as STRING.

Native `dbcrossbar` schemas

dbcrossbar supports a native schema format that exactly represents all types supported by dbcrossbar. It can be used as follows:

--schema dbcrossbar-schema:my_table.json

For more details and example, see the chaper on portable table schemas.

Typical uses

This format is cumbersome to edit by hand, but it is fairly useful in a number of circumstances:

Specifying column types that can't be exactly represented by other schema formats.
Reading or editing schemas using scripts.

TypeScript schemas (UNSTABLE)

WARNING: This is highly experimental and subject to change. To use it, you must enable it using the --enable-unstable flag.

To specify the column names and types for table using a subset of TypeScript, use:

--schema "dbcrossbar-ts:my_table.ts#MyTable"

The file my_table.ts can contain one or more interface definitions:

interface MyTable {
    id: string,
    name: string,
    quantity: number,
}

"Magic" types

Certain dbcrossbar types can be specified by adding the following declarations to a TypeScript file:

// Decimal numbers which can exactly represent
// currency values with no rounding.
type decimal = number | string;

// Integers of various sizes.
type int16 = number | string;
type int32 = number | string;
type int64 = number | string;

These may then be used as follows:

interface OrderItem {
    id: int64,
    sku: string,
    unit_price: decimal,
    quantity: int16,
}

When the TypeScript schema is converted to a portable dbcrossbar schema, the "magic" types will be replaced with the corresponding portable type.

Advanced features

We also support nullable values, arrays and nested structures:

type decimal = number | string;
type int16 = number | string;
type int32 = number | string;
type int64 = number | string;

interface Order {
    id: int64,
    line_items: OrderItem[],
    note: string | null,
}

interface OrderItem {
    id: int64,
    sku: string,
    unit_price: decimal,
    quantity: int16,
}

Nested arrays and structs will translate to appropriate database-specific types, such as BigQuery ARRAY and STRUCT types.

Limitations

This schema format has a number of limitations:

There's no way to convert other schema formats into this one (yet).
Some portable dbcrossbar types can't be represented in this format.
Only a small subset of TypeScript is supported (but we try to give good error messages).

Changelog

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

Using dbcrossbar