dbcrossbar
Guide
dbcrossbar
is an open source tool that copies large, tabular datasets between many different databases and storage formats. Data can be copied from any source to any destination.
An example
If we have a CSV file my_table.csv
containing data:
id,name,quantity
1,Blue widget,10
2,Red widget,50
And a file my_table.sql
containing a table definition:
CREATE TABLE my_table (
id INT NOT NULL,
name TEXT NOT NULL,
quantity INT NOT NULL
);
Then we can use these to create a PostgreSQL table:
dbcrossbar cp \
--if-exists=overwrite \
--schema=postgres-sql:my_table.sql \
csv:my_table.csv \
'postgres://postgres@127.0.0.1:5432/postgres#my_table'
If we want to use this data to update an existing table in BigQuery, we can upsert into BigQuery using the id
column:
dbcrossbar config add temporary gs://$GS_TEMP_BUCKET
dbcrossbar config add temporary bigquery:$GCLOUD_PROJECT:temp_dataset
dbcrossbar cp \
--if-exists=upsert-on:id \
'postgres://postgres@127.0.0.1:5432/postgres#my_table' \
bigquery:$GCLOUD_PROJECT:my_dataset.my_table
This will stream the data out of PostgreSQL, upload it to $GS_TEMP_BUCKET
, import it into a temporary BigQuery table, and run an appropriate MERGE
command. The config add temporary
commands tell dbcrossbar
what cloud bucket and BigQuery dataset should be used for temporary files and tables, respectively.
Notice that we don't need to specify --schema
, because dbcrossbar
will automatically translate the PostgreSQL column types to corresponding BigQuery types.
Credits
dbcrossbar
is generously supported by Faraday and by open source contributors. Please see the credits for more information.
Features & philosophy
dbcrossbar
is designed to do a few things well. Typically, dbcrossbar
is used for loading raw data, and for moving data back and forth between production databases and data warehouses. It supports a few core features:
- Copying tables.
- Counting records in tables.
- Converting between different table schema formats, including PostgreSQL
CREATE TABLE
statements and BigQuery schema JSON.
dbcrossbar
offers a number of handy features:
- A single static binary on Linux, with no dependencies.
- A stream-based architecture that limits the use of RAM and requires no temporary files.
- Support for appending, overwriting or upserting into existing tables.
- Support for selecting records using
--where
.
dbcrossbar
also supports a rich variety of portable column types:
- Common types, including booleans, dates, timestamps, floats, integers, and text.
- UUIDs.
- JSON.
- GeoJSON.
- Arrays.
Non-features
The following features are explicitly excluded from dbcrossbar
's mission:
- Data cleaning and transformation.
- Fixing invalid column names.
- Copying multiple tables at time.
- Automatically copying constraints, foreign keys, etc.
If you need these features, then take a look at tools like scrubcsv
and pgloader
.
Installing
Pre-built binaries for dbcrossbar
are available on GitHub. These currently include:
- Fully-static Linux x86_64 binaries, which should work on any modern distribution (including Alpine Linux containers).
- MacOS X binaries. Note that these are unsigned, and you may need to use
xattr -d com.apple.quarantine dbcrossbar
andspctl --add dbcrossbar
to make them runnable.
Windows binaries are not available at this time, but it may be possible to build them with a little work.
Required tools
To use the S3 and RedShift drivers, you will need to install the AWS CLI tools. We plan to replace the AWS CLI tools with native Rust libraries before the 1.0 release.
Installing using cargo
You can also install dbcrossbar
using cargo
. First, you will need to make sure you have the necessary C dependencies installed:
# Ubuntu Linux (might be incomplete).
sudo apt install build-essential
Then, you can install using cargo
:
cargo install dbcrossbar
Building from source
The source code is available on GitHub. First, install the build dependencies as described above. Then run:
git clone https://github.com/dbcrossbar/dbcrossbar.git
cd dbcrossbar
cargo build --release
This will create target/release/dbcrossbar
.
How it works
dbcrossbar
uses pluggable input and output drivers, allowing any input to be copied to any output:
Parallel data streams
Internally, dbcrossbar
uses parallel data streams. If we copy s3://example/
to csv:out/
using --max-streams=4
, this will run up to 4 copies in parallel:
As soon as one stream finishes, a new one will be started:
dbcrossbar
accomplishes this using a stream of CSV streams. This allows us to make extensive use of backpressure to control how data flows through the system, eliminating the need for temporary files. This makes it easier to work with 100GB+ CSV files and 1TB+ datasets.
Shortcuts
When copying between certain drivers, dbcrossbar
supports "shortcuts." For example, it can load data directly from Google Cloud Storage into BigQuery.
Multi-threaded, asynchronous Rust
dbcrossbar
is written using asynchronous Rust, and it makes heavy use of a multi-threaded worker pool. Internally, it works something like a set of classic Unix pipelines running in parallel. Thanks to Rust, it has been possible to get native performance and multithreading without spending too much time debugging.
CSV interchange format
Internally, dbcrossbar
converts all data into CSV streams. For many standard types, all input drivers are required to provide byte-for-byte identical CSV data:
id,test_bool,test_date,test_int16,test_int32,test_int64,test_text,test_timestamp_without_time_zone,test_timestamp_with_time_zone,test_uuid,select,testCapitalized,test_enum
1,t,1969-07-20,-32768,-2147483648,-9223372036854775808,hello,1969-07-20T20:17:39,1969-07-20T20:17:39Z,084ec3bb-3193-4ffb-8b74-99a288e8432c,,,red
2,f,2001-01-01,32767,2147483647,9223372036854775807,,,,,,,green
3,,,,,,,,,,,,
For more complex types such as arrays, structs, JSON, and GeoJSON data, we embed JSON into the CSV file:
test_null,test_not_null,test_bool,test_bool_array,test_date,test_date_array,test_float32,test_float32_array,test_float64,test_float64_array,test_geojson,test_geojson_3857,test_int16,test_int16_array,test_int32,test_int32_array,test_int64,test_int64_array,test_json,test_text,test_text_array,test_timestamp_without_time_zone,test_timestamp_without_time_zone_array,test_timestamp_with_time_zone,test_timestamp_with_time_zone_array,test_uuid,test_uuid_array,test_enum
,hi,t,"[true,false]",1969-07-20,"[""1969-07-20""]",1e+37,"[1e-37,0,100.125,1e+37]",1e+37,"[1e-37,0,1000.125,1e+37]","{""type"":""Point"",""coordinates"":[-71,42]}","{""type"":""Point"",""coordinates"":[-71,42]}",16,"[-32768,0,32767]",32,"[-2147483648,0,2147483647]",64,"[""-9223372036854775808"",""0"",""9223372036854775807""]","{""x"": 1, ""y"": 2}",hello,"[""hello"",""""]",1969-07-20T20:17:39.5,"[""1969-07-20T20:17:39.5""]",1969-07-20T20:17:39.5Z,"[""1969-07-20T20:17:39.5Z""]",084ec3bb-3193-4ffb-8b74-99a288e8432c,"[""084ec3bb-3193-4ffb-8b74-99a288e8432c""]",red
Tricks for preparing CSV data
If your input CSV files use an incompatible format, there are several things that might help. If your CSV files are invalid, non-standard, or full of junk, then you may be able to use scrubcsv
or xsv
to fix the worst problems.
If you need to clean up your data manually, then you may want to consider using dbcrossbar
to load your data into BigQuery, and set your columns to type STRING
. Once this is done, you can parse and normalize your data quickly using SQL queries.
Portable table schema
Internally, dbcrossbar
uses a portable table "schema" format. This provides a common ground between PostgreSQL's CREATE TABLE
statements, BigQuery's JSON schemas, and equivalent formats for other databases. For more information, see:
All table schemas and column types are converted into the portable format and then into the appropriate destination format.
Normally, you won't need to work with this schema format directly, because dbcrossbar
can parse BigQuery schemas, PostgreSQL CREATE TABLE
statments, and several other popular schema formats. It can also read schemas directly from some databases. See the conv
command for details.
Example schema
{
"named_data_types": [
{
"name": "format",
"data_type": {
"one_of": [
"gif",
"jpeg"
]
}
}
],
"tables": [
{
"name": "images",
"columns": [
{
"name": "id",
"is_nullable": false,
"data_type": "uuid"
},
{
"name": "url",
"is_nullable": false,
"data_type": "text"
},
{
"name": "format",
"is_nullable": true,
"data_type": {
"named": "format"
}
},
{
"name": "metadata",
"is_nullable": true,
"data_type": "json"
},
{
"name": "thumbnails",
"is_nullable": true,
"data_type": {
"array": {
"struct": [
{
"name": "url",
"is_nullable": false,
"data_type": "text"
},
{
"name": "width",
"data_type": "float64",
"is_nullable": false
},
{
"name": "height",
"data_type": "float64",
"is_nullable": false
}
]
}
}
}
]
}
]
}
Schema properties
named_data_types
(experimental): Named data types. These are used to declare custom column types. They're analogous to a PostgresCREATE TYPE
statement, or a TypeScriptinterface
ortype
statement. Two types with different names but the same layout are considered to be different types (nominal typing). Many database drivers place restrictions on these types for now, but we hope to relax those restrictions in the future.tables
: A list of table definitions. For now, this must contain exactly one element.
Table properties
name
: The name of this table. This is normally only used when serializing to schema formats that require a table name.columns
: A list of columns in the table.
Column properties
name
: The name of the column.is_nullable
: Can the column containNULL
values?data_type
: The type of data stored in the column.
Data types
The data_type
field can contain any of:
{ "array": element_type }
: An array ofelement_type
values."bool"
: A boolean value."date"
: A date, with no associated time value."decimal"
: A decimal integer (can represent currency, etc., without rounding errors)."float32"
: A 32-bit floating point number."float64"
: A 64-bit floating point number.{ "geo_json": srid }
: Geodata in GeoJSON format, using the specified SRID, to specify the spatial reference system."int16"
: A 16-bit signed integer."int32"
: A 32-bit signed integer."int64"
: A 64-bit signed integer."json"
: An arbitrary JSON value.{ "named": name }
: A named type from thenamed_types
list in this schema.{ "one_of": string_list }
: One of the specified string values. This may be represented a string enumeration by certain drivers, or as a "categorical" value in machine-learning systems (as opposed to a free-form textual value). NOTE: In many cases, it would be more efficient to have a separate table with an(id, value)
entry for each enum value, and to refer to it using a foreign key.one_of
is most useful when importing cleaned data, and when working with databases that support efficient string enumerations.{ "struct": fields }
: A structure with a list of specific, named fields. Each field has the following properties:name
: The name of the field.is_nullable
: Can the field containNULL
values?data_type
: The type of data stored in the field.
"text"
: A string."timestamp_without_time_zone"
: A date and time without an associated timezone."timestamp_with_time_zone"
: A date and time with an associated timezone."uuid"
: A UUID value.
Configuring dbcrossbar
dbcrossbar
can read information from a configuration directory. By default, this can be found at:
- Linux:
~/.config
- MacOS:
~/Library/Preferences
To override this default location, you can set DBCROSSBAR_CONFIG_DIR
to point to an alternate configuration directory.
If a file dbcrossbar.toml
appears in this directory, dbcrossbar
will read its configuration from that file. Other files may be placed in this directory, including certain local credential files.
Modifying the configuration file
You can modify the dbcrossbar.toml
file from the command line using the config
subcommand. For example:
dbcrossbar config add temporary s3://example/temp/
dbcrossbar config rm temporary s3://example/temp/
Using config add temporary
allows you to specify default values for --temporary
flags. You can still override specific defaults by passing --temporary
to commands that use it.
Commands
dbcrossbar
supports four main subcommands:
dbcrossbar cp
: Copy tabular data.dbcrossbar count
: Count records.dbcrossbar schema conv
: Convert table schemas between databases.
For more information, type dbcrossbar --help
or dbcrossbar $CMD --help
.
Not all drivers support all the features of each command. To see the available drivers and what commands they support, run dbcrossbar features
and dbcrossbar features $DRIVER_NAME
.
cp
: Copying tables
The cp
command copies tabular data from a source location to a destination location. For example, we can copy a CSV file into PostgreSQL, replacing any existing table:
dbcrossbar cp \
--if-exists=overwrite \
--schema=postgres-sql:my_table.sql \
csv:my_table.csv \
'postgres://postgres@127.0.0.1:5432/postgres#my_table'
Or we copy data from PostgreSQL and upsert it into a BigQuery table:
dbcrossbar config add temporary gs://$GS_TEMP_BUCKET
dbcrossbar config add temporary bigquery:$GCLOUD_PROJECT:temp_dataset
dbcrossbar cp \
--if-exists=upsert-on:id \
'postgres://postgres@127.0.0.1:5432/postgres#my_table' \
bigquery:$GCLOUD_PROJECT:my_dataset.my_table
Command-line help
Usage: dbcrossbar cp [OPTIONS] <FROM_LOCATOR> <TO_LOCATOR>
Arguments:
<FROM_LOCATOR> The input table
<TO_LOCATOR> The output table
Options:
--if-exists <IF_EXISTS> One of `error`, `overwrite`, `append` or `upsert-on:COL` [default:
error]
--schema <SCHEMA> The schema to use (defaults to input table schema)
--temporary <TEMPORARIES> Temporary directories, cloud storage buckets, datasets to use during
transfer (can be repeated)
--stream-size <STREAM_SIZE> Specify the approximate size of the CSV streams manipulated by
`dbcrossbar`. This can be used to split a large input into multiple
smaller outputs. Actual data streams may be bigger or smaller
depending on a number of factors. Examples: "100000", "1Gb"
--from-arg <FROM_ARGS> Pass an extra argument of the form `key=value` to the source driver
--from-format <FROM_FORMAT> For directory- and file-like data sources, the format to assume. If
not specified, `dbcrossbar` will use the file extension to guess the
format
--to-arg <TO_ARGS> Pass an extra argument of the form `key=value` to the destination
driver
-F, --to-format <TO_FORMAT> For directory-like data destinations, the format to use. If not
specified, `dbcrossbar` will use the destination file extension (if
provided) or `csv`
--where <WHERE_CLAUSE> SQL where clause specifying rows to use
-J, --max-streams <MAX_STREAMS> How many data streams should we attempt to copy in parallel? [default:
4]
--display-output-locators Display where we wrote our output data
-h, --help Print help
EXAMPLE LOCATORS:
postgres://localhost:5432/db#table
bigquery:project:dataset.table
Flags
Not all command-line options are supported by all drivers. See the chapter on each driver for details.
--where
Specify a WHERE
clause to include in the SQL query. This can be used to select a subset of the source rows.
--from-arg
This can be used to specify driver-specific options for the source driver. See the chapter for that driver.
--if-exists=error
If the destination location already contains data, exit with an error.
--if-exists=append
If the destination location already contains data, append the new data.
--if-exists=overwrite
If the destination location already contains data, replace it with the new data.
--if-exists=upset-on:COL1,..
For every row in the new data:
- If a row with a matching
col1
,col2
, etc., exists, use the new data to update that row. - If no row matching
col1
,col2
, etc., exists, then insert the new row.
The columns col1
, col2
, etc., must be marked as NOT NULL
.
--schema
By default, dbcrossbar
will use the schema of the source table. But when this can't be inferred automatically, --schema
can be used to specify a table schema:
--schema=postgres-sql:my_table.sql
: A PostgreSQLCREATE TABLE
statement.--schema=bigquery-schema:my_table.json
: A BigQuery JSON schema.--schema=dbcrossbar-schema:my_table.json
: An internaldbcrossbar
schema.
It's also possible to use a schema from an existing database table:
--schema=postgres://localhost:5432/db#table
--schema=bigquery:project:dataset.table
Note that it's possible to create a BigQuery table using a PostgreSQL schema, or vice versa. Internally, all schemes are first converted to the internal schema format.
--temporary
Specify temporary storage, which is required by certain drivers. Typical values include:
--temporary=s3://$S3_TEMP_BUCKET
--temporary=gs://$GS_TEMP_BUCKET
--temporary=bigquery:$GCLOUD_PROJECT:temp_dataset
--to-arg
This can be used to specify driver-specific options for the destination driver. See the chapter for that driver.
count: Counting records
This command mostly works like the cp
command, except that it prints out a number of rows. Check your driver to see if it supports count
.
Command-line help
Usage: dbcrossbar count [OPTIONS] <LOCATOR>
Arguments:
<LOCATOR> The locator specifying the records to count
Options:
--schema <SCHEMA> The schema to use (defaults to input table schema)
--temporary <TEMPORARIES> Temporary directories, cloud storage buckets, datasets to use during
transfer (can be repeated)
--from-arg <FROM_ARGS> Pass an extra argument of the form `key=value` to the source driver
--from-format <FROM_FORMAT> For directory- and file-like data sources, the format to assume. If
not specified, `dbcrossbar` will use the file extension to guess the
format
--where <WHERE_CLAUSE> SQL where clause specifying rows to use
-h, --help Print help
EXAMPLE LOCATORS:
postgres://localhost:5432/db#table
bigquery:project:dataset.table
schema conv
: Transforming schemas
The schema conv
command can be used to convert between different database schemas. To convert from a PostgreSQL CREATE TABLE
statement to a BigQuery schema, run:
dbcrossbar schema conv postgres-sql:table.sql bigquery-schema:table.json
As a handy trick, you can also use a CSV source, which will generate a CREATE TABLE
where all columns have the type TEXT
:
dbcrossbar schema conv csv:data.csv postgres-sql:table.sql
This can then be edited to specify appropriate column types.
Command-line help
Usage: dbcrossbar schema conv [OPTIONS] <FROM_LOCATOR> <TO_LOCATOR>
Arguments:
<FROM_LOCATOR> The input schema
<TO_LOCATOR> The output schema
Options:
--if-exists <IF_EXISTS> One of `error`, `overrwrite` or `append` [default: error]
--from-arg <FROM_ARGS> Pass an extra argument of the form `key=value` to the source driver
--to-arg <TO_ARGS> Pass an extra argument of the form `key=value` to the destination driver
-h, --help Print help
EXAMPLE LOCATORS:
postgres-sql:table.sql
postgres://localhost:5432/db#table
bigquery-schema:table.json
Drivers
dbcrossbar
uses built-in "drivers" to read and write CSV data and table schemas.
BigML
BigML is a hosted machine-learning service, with support for many common algorithms and server-side batch scripts.
Example locators
Source locators:
bigml:dataset/$ID
: Read data from a BigML dataset.
Destination locators:
bigml:source
: Create a single BigML "source" resource from the input data.bigml:sources
: Create multiple BigML "source" resources from the input data.bigml:dataset
: Create a single BigML "dataset" resource from the input data.bigml:datasets
: Create multiple BigML "dataset" resources from the input data.
If you use BigML as a destination, dbcrossbar
will automatically activate --display-output-locators
, and it will print locators for all the created resources on standard output. Column types on created "source" resources will be set something appropriate (but see optype_for_text
below.)
Configuration & authentication
The BigML driver requires more configuration than most.
You'll need to set the following environment variables:
BIGML_USERNAME
: Set this to your BigML username.BIGML_API_KEY
: Set this to your BigML API key.BIGML_DOMAIN
(optional): Set this to the domain name of your BigML instance, if it's not located at the standard address.AWS_ACCESS_KEY_ID
andAWS_SECRET_ACCESS_KEY
: Set these to your AWS credentials when using BigML as a destination. Do not setAWS_SESSION_TOKEN
; it will not work with BigML.
You'll also need to pass the following on the command line when using:
--temporary=s3://$S3_TEMP_BUCKET
: Specify where to stage files for loading into BigML. This is not needed when using BigML as a source.
You can also specify the following --to-arg
values:
name
: The human-readable name of the resource to create.optype_for_text
: The BigML optype to use for text fields. This defaults totext
. You may want to set it tocategorical
if your text fields contain a small set of fixed strings. But you should probably usedbcrossbar
's{ "one_of": string_list }
types instead, which will always map tocategorical
.tag
: This may be specified repeatedly to attach tags to the created resources.
Supported features
bigml features:
- conv FROM
- cp FROM:
- cp TO:
--to-arg=$NAME=$VALUE
--if-exists=error --if-exists=append --if-exists=overwrite --if-exists=upsert-on:col
Note that --if-exists
is simply ignored, because BigML will always create new resources.
BigQuery
Google's BigQuery is a extremely scalable data warehouse that supports rich SQL queries and petabytes of data. If you need to transform or analyze huge data sets, it's an excellent tool.
When loading data into BigQuery, or extracting it, we always go via Google Cloud Storage. This is considerably faster than the load and extract functionality supplied by tools like bq
.
Example locators
bigquery:$PROJECT:$DATASET.$TABLE
: A BigQuery table.bigquery-test-fixture:$PROJECT:$DATASET.$TABLE
: If you only need a tiny, read-only "table" for testing purposes, you may want to try thebigquery-test-fixture:
locator. It currently usestables.insert
to pass atable.view.query
with all the table data inlined into theVIEW
SQL. This runs about 20 times faster thanbigquery:
, at the expense of not creating a regular table. Note that the implementation details of this method may change, if we discover a faster or better way to create a small, read-only table.
Configuration & authentication
See the Cloud Storage driver for authentication details.
The following command-line options will usually need to be specified for both sources and destinations:
--temporary=gs://$GS_TEMP_BUCKET
: A Google Cloud Storage bucket to use for staging data in both directions.--temporary=bigquery:$GCLOUD_PROJECT:temp_dataset
You can also specify Google Cloud resource labels to apply to all BigQuery jobs. Labels are often used to track query costs.
--from-arg=job_labels[department]=marketing
--to-arg=job_labels[project]=project1
You can pick a project to run jobs in, for example for billing purposes.
--from-arg=job_project_id=my-gcp-billing-project
--to-arg=job_project_id=my-gcp-billing-project
Supported features
bigquery features:
- conv FROM
- count
--from-arg=$NAME=$VALUE --where=$SQL_EXPR
- cp FROM:
--from-arg=$NAME=$VALUE --where=$SQL_EXPR
- cp TO:
--to-arg=$NAME=$VALUE
--if-exists=error --if-exists=append --if-exists=overwrite --if-exists=upsert-on:col
CSV
dbcrossbar
works with valid CSV files in our CSV interchange format. For invalid CSV files, take a look at scrubcsv
. For CSV files which need further transformation and parsing, considering loading them into BigQuery and cleaning them up using SQL. This works very well even for large datasets.
Example locators
The following locators can be used for both input and output:
csv:file.csv
: A single CSV file.csv:dir/
: A directory tree containing CSV files.csv:-
: Read from standard input, or write to standard output.
To concatenate CSV files, use:
dbcrossbar cp csv:input/ csv:merged.csv
To split a CSV file, use --stream-size
:
dbcrossbar cp --stream-size="100Mb" csv:giant.csv csv:split/
Configuration & authentication
None.
Supported features
csv features:
- conv FROM
- cp FROM:
- cp TO:
--if-exists=error --if-exists=overwrite
Files
dbcrossbar
can read and/or write files in a number of formats, including:
- Our CSV interchange format. For invalid CSV files, take a look at
scrubcsv
. For CSV files which need further transformation and parsing, considering loading them into BigQuery and cleaning them up using SQL. This works very well even for large datasets. - JSON Lines (input only).
Example locators
The following locators can be used for both input and output (if supported by the format):
file:file.csv
: A single CSV file.file:dir/
: A directory tree containing CSV files.file:-
: Read from standard input, or write to standard output.
To concatenate CSV files, use:
dbcrossbar cp csv:input/ csv:merged.csv
To split a CSV file, use --stream-size
:
dbcrossbar cp --stream-size="100Mb" csv:giant.csv csv:split/
Configuration & authentication
None.
Supported features
file features:
- cp FROM:
--format=$FORMAT
- cp TO:
--format=$FORMAT
--if-exists=error --if-exists=overwrite
Google Cloud Storage
Google Cloud Storage is a bucket-based storage system similar to Amazon's S3. It's frequently used in connection with BigQuery and other Google Cloud services.
Example locators
Source locators:
gs://bucket/dir/file.csv
gs://bucket/dir/
Destination locators:
gs://bucket/dir/
At this point, we do not support single-file output to a cloud bucket. This is relatively easy to add, but has not yet been implemented.
Configuration & authentication
You can authenticate using either a client secret or a service key, which you can create using the console credentials page.
- Client secrets can be stored in
$DBCROSSBAR_CONFIG_DIR/gcloud_client_secret.json
or inGCLOUD_CLIENT_SECRET
. These are strongly recommended for interactive use. - Service account keys can be stored in
$DBCROSSBAR_CONFIG_DIR/gcloud_service_account_key.json
or inGCLOUD_SERVICE_ACCOUNT_KEY
. These are recommended for server and container use.
For more information on DBCROSSBAR_CONFIG_DIR
, see Configuration.
For a service account, you can use the following permissions:
- Storage Object Admin (Cloud Storage and BigQuery drivers)
- BigQuery Data Editor (BigQuery driver only)
- BigQuery Job User (BigQuery driver only)
- BigQuery User (BigQuery driver only)
There's probably a more limited set of permissions which will work if you set them up manually.
Supported features
gs features:
- cp FROM:
- cp TO:
--if-exists=overwrite
PostgreSQL
PostgreSQL is an excellent general-purpose SQL database.
Example locators
dbcrossbar
supports standard PostgreSQL locators followed by #table_name
:
postgres://postgres:$PASSWORD@127.0.0.1:5432/postgres#my_table
Note that PostgreSQL sources will currently output all data as a single stream. This can be split into multiple streams using the --stream-size
option if desired.
Configuration & authentication
Authentication is currently handled using standard postgres://user:pass@...
syntax, similar to psql
. We may add alternative mechanisms at some point to avoid passing credentials on the command-line.
Supported features
postgres features:
- conv FROM
- count
--where=$SQL_EXPR
- cp FROM:
--where=$SQL_EXPR
- cp TO:
--if-exists=error --if-exists=append --if-exists=overwrite --if-exists=upsert-on:col
RedShift
Amazon's Redshift is a cloud-based data warehouse designed to support analytical queries. This driver receives less testing than our BigQuery driver, because the cheapest possible RedShift test system costs over $100/month. Sponsors are welcome!
Example locators
These are identical to PostgreSQL locators, except that postgres
is replaced by redshift
:
redshift://postgres:$PASSWORD@127.0.0.1:5432/postgres#my_table
Configuration & authentication
Authentication is currently handled using the redshift://user:pass@...
syntax. We may add alternative mechanisms at some point to avoid passing credentials on the command-line.
The following environment variables are required.
AWS_ACCESS_KEY_ID
andAWS_SECRET_ACCESS_KEY
: Set these to your AWS credentials.AWS_SESSION_TOKEN
(optional): This should work, but it hasn't been tested.
The following --temporary
flag is required:
--temporary=s3://$S3_TEMP_BUCKET
: Specify where to stage files for loading or unloading data.
Authentication credentials for COPY
may be passed using --to-arg
. For example:
--to-arg=iam_role=$ROLE
--to-arg=region=$REGION
This may require some experimentation.
If you need to generate "-- partner:" SQL comments for an AWS RedShift partner program, you can do it as follows:
--to-arg=partner="myapp v1.0"
Supported features
redshift features:
- conv FROM
- cp FROM:
--from-arg=$NAME=$VALUE --where=$SQL_EXPR
- cp TO:
--to-arg=$NAME=$VALUE
--if-exists=error --if-exists=append --if-exists=overwrite --if-exists=upsert-on:col
S3
Amazon's S3 is a bucket-based system for storing data in the cloud.
Example locators
Source locators:
s3://bucket/dir/file.csv
s3://bucket/dir/
Destination locators:
s3://bucket/dir/
At this point, we do not support single-file output to a cloud bucket. This is relatively easy to add, but has not yet been implemented.
Configuration & authentication
The following environment variables are used to authenticate:
AWS_ACCESS_KEY_ID
(required): The ID for your AWS credentials.AWS_SECRET_ACCESS_KEY
(required): The secret part of your AWS credentials.AWS_SESSION_TOKEN
(optional): Set this to use temporary AWS crdentials.AWS_DEFAULT_REGION
(required): Set this to your AWS region.
Supported features
s3 features:
- cp FROM:
- cp TO:
--if-exists=overwrite
Shopify (UNSTABLE)
WARNING: This is highly experimental and will likely be removed in a future release in favor of the new JSON Lines support. To use it for now, you must enable it using the --enable-unstable
flag.
Shopify is an online e-commerce platform with a REST API for fetching data.
Example locators
Locators look just like Shopify REST API URLs, but with https:
replaced with shopify
:
shopify://$SHOP/admin/api/2020-04/orders.json?status=any
For a schema, download shopify.ts, and refer to it as follows:
--schema="dbcrossbar-ts:shopify.ts#Order"
We do not currently include a default Shopify schema in dbcrossbar
itself, because it's still undergoing significant changes.
Configuration & authentication
The following environment variables are required:
SHOPIFY_AUTH_TOKEN
: The Shopify authorization token to use. (We don't yet support password authentication, but it would be easy enough to add.)
Supported features
shopify features:
- cp FROM:
This driver is UNSTABLE and may change without warning.
Schema drivers
dbcrossbar
allows you to specify a table's column names and types in a number of different ways. You can use Postgres CREATE TABLE
statements, or BigQuery schema JSON, or dbcrossbar
's internal schema format.
These schema formats are typically used in one of two ways:
-
As a
--schema
argument to thecp
subcommand.dbcrossbar cp \ --if-exists=overwrite \ --schema=postgres-sql:my_table.sql \ csv:my_table.csv \ 'postgres://postgres@127.0.0.1:5432/postgres#my_table'
-
As an argument to the
conv
subcommand, which allows you to convert between different schema formats.dbcrossbar schema conv postgres-sql:table.sql bigquery-schema:table.json
Postgres CREATE TABLE
statements
To specify the column names and types for table in SQL format, use:
--schema postgres-sql:my_table.sql
The file my_table.sql
can contain a single CREATE TABLE
statement using a subset of PostgreSQL's syntax:
CREATE TABLE my_table (
id INT NOT NULL,
name TEXT NOT NULL,
quantity INT NOT NULL
);
Limitations
This schema format offers support for singly-nested array types, and it doesn't support structure types at all.
BigQuery JSON schemas
To specify the column names and types for table in BigQuery JSON format, use:
--schema bigquery-schema:my_table.json
The file my_table.json
should be a BigQuery JSON schema file:
[
{
"name": "id",
"type": "INT64",
"mode": "REQUIRED"
},
{
"name": "name",
"type": "STRING",
"mode": "REQUIRED"
},
{
"name": "quantity",
"type": "INT64",
"mode": "REQUIRED"
}
]
Limitations
This schema format supports a small number of general types. For example, all integer types are represented as INT64
, all floating-point types are represented as FLOAT64
, and both JSON values and UUIDs are represented as STRING
.
Native dbcrossbar
schemas
dbcrossbar
supports a native schema format that exactly represents all types supported by dbcrossbar
. It can be used as follows:
--schema dbcrossbar-schema:my_table.json
For more details and example, see the chaper on portable table schemas.
Typical uses
This format is cumbersome to edit by hand, but it is fairly useful in a number of circumstances:
- Specifying column types that can't be exactly represented by other schema formats.
- Reading or editing schemas using scripts.
TypeScript schemas (UNSTABLE)
WARNING: This is highly experimental and subject to change. To use it, you must enable it using the --enable-unstable
flag.
To specify the column names and types for table using a subset of TypeScript, use:
--schema "dbcrossbar-ts:my_table.ts#MyTable"
The file my_table.ts
can contain one or more interface
definitions:
interface MyTable {
id: string,
name: string,
quantity: number,
}
"Magic" types
Certain dbcrossbar
types can be specified by adding the following declarations to a TypeScript file:
// Decimal numbers which can exactly represent
// currency values with no rounding.
type decimal = number | string;
// Integers of various sizes.
type int16 = number | string;
type int32 = number | string;
type int64 = number | string;
These may then be used as follows:
interface OrderItem {
id: int64,
sku: string,
unit_price: decimal,
quantity: int16,
}
When the TypeScript schema is converted to a portable dbcrossbar
schema, the "magic" types will be replaced with the corresponding portable type.
Advanced features
We also support nullable values, arrays and nested structures:
type decimal = number | string;
type int16 = number | string;
type int32 = number | string;
type int64 = number | string;
interface Order {
id: int64,
line_items: OrderItem[],
note: string | null,
}
interface OrderItem {
id: int64,
sku: string,
unit_price: decimal,
quantity: int16,
}
Nested arrays and structs will translate to appropriate database-specific types, such as BigQuery ARRAY
and STRUCT
types.
Limitations
This schema format has a number of limitations:
- There's no way to convert other schema formats into this one (yet).
- Some portable
dbcrossbar
types can't be represented in this format. - Only a small subset of TypeScript is supported (but we try to give good error messages).
Changelog
All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
[Unreleased]
[0.5.8-pre.3] - 2024-11-01
Fixed
bigquery-test-fixture
: Retry 403 errors from Google API rate limiter.
[0.5.8-pre.2] - 2024-10-29
Fixed
bigquery-test-fixture:
: Retry 503 errors affecting CREATE VIEW calls.
[0.5.8-pre.1] - 2024-10-22
Added
- Honor
AWS_ENDPOINT
even ifawscli
does not support it. This is needed for use with some S3 replacements.
Changed
- redshift: Text columns are now represented as
VARCHAR(MAX)
, which has unlimited size, instead ofTEXT
, which is apparently limited to 255 bytes. This may be a breaking change for some use cases, but RedShift is our least supported driver and we have minimal test infrastructure.
Security
- Library updates for several advisories.
[0.5.7] - 2024-01-26
Added
- gcloud: Allow passing
extra_scopes
to Google OAuth2.
Security
- Fix RUSTSEC-2024-0003, a Denial of Service vulnerability which could apparently cause a hostile HTTP server to cause out-of-memory errors
in a client. Since
dbcrossbar
is unlikely to connect to many hostile HTTP servers, this is probably a low-priority fix for almost all users.
[0.5.6] - 2024-01-08
Added
- bigquery: Users can now specify an alternate job project, for example for billing purposes. See docs for details.
[0.5.5] - 2023-09-14
Added
- Update to latest
opinionated_telemetry
. This should not affect users.
[0.5.4] - 2023-09-08
Added
- Output in JSON Lines format.
- Experimental new support for Prometheus and Google Cloud Trace via
opinionated_telemetry
.
Removed
dbcrossbarlib
is not longer provided separately. As before, we recommend calling out todbcrossbar
using the CLI, rather than trying to link against its relatively complex dependencies.
[0.5.3] - 2023-08-01
Changed
- The
csv:
driver has been superceded by a more genericfile:
driver. The old CSV driver will continue to work, perhaps indefinitely, but it may be increasingly hidden. By default, thefile:
driver determines the file format using the--from-format
command line argument, or the file extension, falling back to CSV format as a last resort. I think this should be fully backwards-compatible. But if you encounter breaking changes, try adding--from-format=csv
to your command line.
Fixed
- Autogenerated
--help
summaries in the docs are now up to date. - Fixed spelling of
geo_json
in help. Thanks, Tom! - Fixed handling of empty arrays in
bigquery-test-fixture:
. Thanks, Dave!
[0.5.2] - 2022-11-03
Fixed
- Fix CI build and binary releases broken by new warnings in latest Rust update.
[0.5.1] - 2022-11-03
Fixed
- bigquery-test-fixture: Add support for single-column inputs. Many thanks to Dave Shirley!
- bigquery-test-fixture: Allow larger outputs before switching to fallback mode. Thanks again to Dave Shirley!
[0.5.0] - 2022-10-22
Added
- New
bigquery-test-fixture:
schema creates small, read-only "tables" very quickly. Many thanks to Dave Shirley for the proof of concept and for pairing on the final implementation.
Fixed
- Updated a few dependencies with (likely non-exploitable) security advisories. We have not update all our dependencies, because there are at least two regressions affecting supposedly semver-compatible dependencies. This is actually very rare in the Rust ecosystem, but we were unlucky this time.
[0.5.0-beta.4] - 2022-01-01
Fixed
- gs: Retry
delete
operations after 403 Forbidden errors. This appears to be part of the same hypothesized permissions race that affectedextract
in v0.5.0-beta.3.
[0.5.0-beta.3] - 2021-12-31
Fixed
- bigquery: Retry
extract
operations afteraccessDenied
errors. These errors sometimes occur even when BigQuery should have the necessary access permissions. This may be caused by a permissions race condition somewhere in Google Cloud?
[0.5.0-beta.2] - 2021-12-19
Changed
- The format for printing backtraces has changed. We now use the standard
anyhow
format. - Our logging format has changed considerably, thanks to switch from
slog
to the much more populartracing
. Some logging of complex async streams may be less well-organized, but there's more of it, especially atRUST_LOG=dbcrossbarlib=trace,dbcrossbar=trace,warn
level.
Fixed
- We no longer rely on the unmaintained
failure
, or the less popularslog
ecosystem.
Removed
- The
--log-format
and--log-extra
commands have been removed.
[0.5.0-beta.1] - 2021-12-15
Added
- We provide MacOS X binaries for the new M1 processors. These are unsigned, like our existing Apple Intel binaries. So you'll need to continue to use
xattr -d com.apple.quarantine dbcrossbar
or a similar technique to run them. Or you could build your own binaries.
Changed
- Our downloadable
*.zip
files follow a new naming convention, allowing us to distinguish between Intel and M1 Macs. - OpenSSL has been completely removed from
dbcrossbar
, allowing us to support more platforms. This will also allow us to eventually centralize TLS configuration across alldbcrossbar
drivers.
Removed
- We no longer support hosted Citus from Citus Data, because their TLS certificates do not include
subjectAltName
, which is required by therustls
library. Citus Data will be shutting down shortly, so we recommend keeping around an olderdbcrossbar
for a few more weeks if you need to talk to them.
[0.5.0-alpha.3] - 2021-12-14
Added
- gs: Allow single-file CSV output. This involves copying out of Google Cloud Storage, concatenating, and copying back. But it's handy when you need it.
Changed
- gs: Copying to a
gs://bucket/dir/
URL with--if-exists=overwrite --display-output-locators
will print out a list of files in the destination bucket directory when we're done. Before, it just printed out the destination locator, which was technically allowed, but useless.
Fixed
- Updated many dependencies, fixing several CVEs (none known to be meaningfully exploitable in typical use cases), and possibly some library bugs.
[0.5.0-alpha.2] - 2021-04-27
Fixed
- bigml: Always map columns with a
one-of
type (akaCREATE ENUM
) to BigMLcategorical
columns.
[0.5.0-alpha.1] - 2021-03-04
This release contains a breaking change to the dbcrossbar-schema
output format to enable supporting named types and enumeration types. See below.
Added
-
(EXPERIMENTAL) postgres: The
postgres-sql
andpostgres
drivers nowCREATE TYPE
statements (but only in the"public"
schema). These can be used as follows:CREATE TYPE "format" AS ENUM ('gif', 'jpeg'); CREATE TABLE "images" ( "id" uuid NOT NULL, "url" text NOT NULL, "image_format" "format", "metadata" jsonb );
This change also requires some changes to the
dbcrossbar-schema
format, which are described below. -
(EXPERIMENTAL) The native
dbcrossbar-schema
format now supports a set ofnamed_types
definitions. This allows named types to be defined once, and to then be referred to elsewhere using{ "named": "my_custom_type" }
. -
(EXPERIMENTAL) The native
dbcrossbar-schema
format also supports string enumeration types using a{ "one_of": ["red", "green", "blue"] }
syntax.
Changed
- BREAKING: The
dbcrossbar-schema
output format has changed! It now has top levelnamed_types
andtables
members, and the old top-level table definition is now available as.tables[0]
. See the manual for more details. However,dbcrossbar
can still read the old input format with no problems, so this only affects other programs that parse nativedbcrossbar
schema.
Fixed
- The suggested fixes for RUSTSEC-2020-0146, RUSTSEC-2021-0020 and RUSTSEC-2021-0023 have been applied.
0.4.2-beta.11 - 2021-02-03
Fixed
- gcloud: Retry OAuth2 failures for service accounts.
0.4.2-beta.10 - 2021-02-02
Fixed
- Build fixes for recently-added clippy warnings.
0.4.2-beta.9 - 2021-01-14
Changed
- Update many dependencies, including
tokio
and our many network-related libraries. Tests pass, but this affects almost everything, in one fashion or another.
0.4.2-beta.8 - 2020-10-16
Fixed
- Linux: Fix Linux binary builds by updating to latest
rust-musl-builder
release, which has the newcargo-deny
.
0.4.2-beta.7 - 2020-10-14
Added
- shopify: Added a "partner" argument which can be used to include a "-- partner:" comment in all generated RedShift SQL for use by RedShift partners.
0.4.2-beta.6 - 2020-09-15
Fixed
- shopify: Retry failed downloads a few times. We've been seeing some intermittent failures.
0.4.2-beta.5 - 2020-08-01
Fixed
- gcloud: We now print more useful error messages when Google doesn't send JSON-formatted errors.
- gcloud: We now retry Google Cloud GET requests automatically a few times if it looks like it might help. We'd also love to retry POST requests, but that will require the ability to try to restart streams.
0.4.2-beta.4 - 2020-07-07
Changed
- Update dependencies. The latest
bigml
release contains tweaks to error retry behavior.
0.4.2-beta.3 - 2020-07-07
Changed
- postgres: Our last
diesel
code has been removed, and replaced withtokio-postgres
(which we use elsewhere).
Fixed
- postgres: Fixed #148 to improve support for PostGIS under PostgreSQL 12.
Removed
- The experimental
citus
-related APIs have been removed fromdbcrossbarlib
, because they useddiesel
. This is technically a breaking change fordbcrosslib
, but we don't claim to honor semver fordbcrossbarlib
0.x.y releases.
0.4.2-beta.2 - 2020-06-28
Added
- redshift: Support
--if-exists=upsert-on:key1,key2
. - redshift: Enable
--if-exists=error
.
Changed
- postgres: Temporary tables now use the same schema (i.e. namespace) as the tables they're linked to. This shouldn't be a breaking change unless you've set up your database permissions to forbid it.
Fixed
- postgres: Fixed likely bug upserting into tables with a non-"public" schema.
- postgres: Verify that upsert columns are NOT NULL to prevent possible incorrect upserts. This may be a breaking change, but it also prevents a possible bug.
0.4.2-beta.1 - 2020-06-23
Changed
- Mac: Move configuration directory from
~/Library/Preferences/dbcrossbar
to~/Library/Application Support/dbcrossbar
. If we detect a config directory in the old location, we should print a deprecation warning and use it. - Many dependencies have been updated.
Fixed
- We should now handle multiple sets of Google Cloud OAuth2 credentials correctly.
0.4.1 - 2020-06-16
A bug fix to gs
, and other minor improvements.
Changed
- Replace deprecated
tempdir
withtempfile
.
Fixed
- gs: Correctly pass
page_token
when listing. This prevents an infinite loop in large directories. - Fix new Rust 0.44.0 warnings.
0.4.0 - 2020-06-02
This is a summary of all the major changes since the 0.3.3 release. For more details and minor changes, see the individual CHANGELOG entries for the 0.4.0 preleases.
Added
dbcrossbar
now supports "struct" types, which have a fixed set of named fields. These will be automatically translated to BigQuery STRUCT types or to JSON columns, depending on the destination database.- We now support a CLI-editable config file using commands like
dbcrossbar config add temporary s3://example/temp/
. - Parsing-related error messages should now include context.
- bigquery: Users can now specify billing labels for jobs.
dbcrossbar license
will display the licences for all dependencies.- Unstable features can now be hidden behind the
--enable-unstable
flag, including two new drivers:- UNSTABLE: We now support specifying schemas using a subset of TypeScript.
- UNSTABLE: We now support reading data from Shopify's REST API. This is a testbed for new struct and JSON-related features.
Changed
dbcrossbar conv
is nowdbcrossbar schema conv
.- Because of the new STRUCT support, some corner cases involving struct types and JSON may have changed subtly.
- We replaced
gcloud auth
,gsutil
andbq
with native Rust. This simplifies installation and configuration substantially, and fixes a number of BigQuery-related issues. - AWS credentials must now always be passed via
AWS_ACCESS_KEY_ID
,AWS_SECRET_ACCESS_KEY
,AWS_SESSION_TOKEN
(optional) andAWS_DEFAULT_REGION
(required). This lays the groundwork for replacing theaws
CLI tool with native Rust code, so that we will someday be able to remove our last CLI dependency.
Fixed
- Lots of issues.
Removed
- The data type
{ "other": string }
has been removed from the portable schema format. It was not actually generated by any of our drivers. - bigquery: We now export
ARRAY<STRUCT<...>>
as{ "array": { "struct": ... } }
, instead of exporting it as as single"json"
value.
0.4.0-rc.2 - 2020-06-01
Changed
- postgres: We now transform the portable types
{ "array": "json" }
and{ "array": { "struct": fields } }
intojsonb[]
, instead of automatically changing it to plainjsonb
in an effort to help our users.
0.4.0-rc.1 - 2020-05-31
This is a release candidate for v0.4.0. If no issues are discovered, this will be published as 0.4.0.
This release contains a last few breaking changes that we want to include before we publicize dbcrossbar
more widely. When migrating, particular attention to the conv
subcommand and AWS_DEFAULT_REGION
below, which have significant breaking changes.
Changed
- Rename
dbcrossbar conv
todbcrossbar schema conv
. - s3: Require
AWS_DEFAULT_REGION
instead of optionally usingAWS_REGION
. This is more compatiable with theaws
CLI command, and it doesn't rely on undocumented region defaults oraws
configuration files.
Added
- Document our portable schema format.
- Document schema-only drivers.
- Improve the documentation in other minor ways.
Removed
- Remove
DataType::Other(String)
, which was not actually used by any of our drivers.
0.4.0-beta.1 - 2020-05-28
We're finally ready to start preparing for an 0.4.0 release! This beta will be deployed to several production systems to help verify that there are no surprising regressions.
Changed
- gs: We now verify CRC32C checksums when uploading.
- gs: We specify
isGenerationMatch
on many operations to make sure that nothing has been created or overridden that we didn't expect.
0.4.0-alpha.7 - 2020-05-26
This release adds support for labeling BigQuery jobs.
Added
- bigquery: Optionally specify billing labels for jobs. See the manual for details.
- Allow driver argument names to be specified as either
x.y
orx[y]
, interchangeably. This makesjob_labels
look nicer. - Hide URL passwords from (most) logs using a dedicated wrapper type.
Changed
- We now have test cases that make sure we catch duplicate driver arguments and raise an error.
- redshift: Authentication argument names may no longer include
-
characters. I'm not even sure whether these are valid, but they won't work with the new scheme for parsing driver arguments. DriverArguments::from_cli_args
now takes an iterator instead of a slice.
0.4.0-alpha.6 - 2020-05-22
This release improves the example shopify.ts
schema, and adds new features to dbcrossbar-ts
to parse it.
Added
- dbcrossbar-ts:
- Parse
/* */
comments. - Allow
Date
to be used as a type. This requires the date to be a string in ISO 8601 format, including a time zone. - Allow
decimal
,int16
,int32
andint64
to be defined as any ofnumber
,string
,number | string
orstring | number
. This allows the schema to more accurately represent what appears on the wire. It allowsdecimal
values to be represented as a mix of floats and strings, which is seen in Shopify.
- Parse
- postgres-sql: Use new format for parse errors.
Fixed
- shopify: The example
shopify.ts
schema has been updated to useDate
andint64
in many places.Address
andCustomerAddress
are now distinct types, and several other minor issues have been fixed.
0.4.0-alpha.5 - 2020-05-21
Added
- BigQuery: Support
--if-exists=error
.
Changed
- Require
--enable-unstable
to usedbcrossbar-ts
orshopify
locators, which are unstable. - AWS credentials must now always be passed via
AWS_ACCESS_KEY_ID
,AWS_SECRET_ACCESS_KEY
,AWS_SESSION_TOKEN
(optional) andAWS_REGION
(optional). This lays the groundwork for replacing theaws
CLI tool with native Rust code, so that we will someday be able to remove our last CLI dependency.
0.4.0-alpha.4 - 2020-05-19
Added
- BigQuery now imports and exports decimal (aka NUMERIC) values everywhere.
- The
dbcrossbar-ts
driver now supports magic type aliases that will convert to the correspondingdbcrossbar
types:type decimal = string;
type int16 = number | string;
type int32 = number | string;
type int64 = number | string;
Changed
- The sample
shopify.ts
schema now usesdecimal
instead ofstring
when appropriate. It does not attempt to useint64
yet.
0.4.0-alpha.3 - 2020-05-19 0 YANKED
This release was yanked because it was missing several things it should have included.
0.4.0-alpha.2 - 2020-05-19
This is a significant release, with support for "struct" types.
Added
- The portable schema now supports a
DataType::Struct(fields)
type that can be used to represent BigQuery STRUCT values (as long as they have unique, named fields) and JSON objects with known keys. - The BigQuery driver now supports importing and exporting
STRUCT
fields using the newDataType::Struct(fields)
type. - EXPERIMENTAL: Schemas can now be specified using the
dbcrossbar-ts
driver, which supports subset of TypeScript type declarations. This is useful for specifying complex, nested structs. This can be used as--schema="dbcrossbar-ts:shopify.ts#Order"
, whereOrder
is the name of the type within the*.ts
file to use as the table's type. - EXPERIMENTAL: We now support a Shopify input driver that uses the Shopify REST API. See the manual for details.
- We now have support for fancy parser error messages, which we use with the
dbcrossbar-ts
parser. - We now support a CLI-editable config file using commands like
dbcrossbar config add temporary s3://example/temp/
.
Changed
- BREAKING: Some corner cases involving struct types and JSON may have changed subtly.
- We've upgraded to the latest
rust-peg
parser syntax everywhere.
Fixed
--if-exists=overwrite
now overwrites when writing to local files (instead of appending).- We automatically create
~/.local/share
if it does not exist. - More
clippy
warnings have been fixed, and unsafe code has been forbidden. - Various obsolete casting libraries have been removed.
0.4.0-alpha.1 - 2020-04-07
Changed
- Replace
gcloud auth
,gsutil
andbq
with native Rust. This changes how we authenticate to Google Cloud. In particular, we now supportGCLOUD_CLIENT_SECRET
,~/.config/dbcrossbar/gcloud_client_secret.json
,GCLOUD_SERVICE_ACCOUNT_KEY
or~/.config/dbcrossbar/gcloud_service_account_key.json
, as explained in the manual. We no longer usegcloud auth
, and the Google Cloud SDK tools are no longer required. In the current alpha version, uploads and deletions are probably slower than before.
Fixed
- gs: Avoid download stalls when backpressure is applied (#103).
- bigquery: Display error messages more reliably (#110).
- bigquery: Detect "`" quotes in the CLI form of table names, and report an error.
0.3.3 - 2020-03-30
Added
- BigML: Honor BIGML_DOMAIN, allowing the user to point the BigML driver to a custom VPC instance of BigML.
0.3.2 - 2020-03-30
Fixed
- Correctly quote BigQuery column names again (which regressed in 0.3.0), and added test cases to prevent further regressions.
- Fix an error that caused
bigquery_upsert
test to fail.
0.3.1 - 2020-03-29
Added
- Write a new manual!
Changed
- Encapsulate all calls to
bq
andgsutil
- Improve performance of
--stream-size
Fixed
- BigQuery: Honor NOT NULL on import (fixes #45)
0.3.0 - 2020-03-26
Added
- Use
cargo deny
to enforce license and duplicate dependency policies - Add notes about license and contribution policies
Changed
- Update to tokio 0.2 and the latest stable Rust
- Replace
wkb
withpostgis
for licensing reasons - BigML: Fail immediately if no S3 temporary bucket provided (fixes #101)
Fixed
- BigQuery: Handle mixed-case column names using BigQuery semantics (fixes #84)
- PostgreSQL: Fix upserts with mixed-case column names
- BigQuery: Correctly output NULL values in Boolean columns (#104)
Removed
- BREAKING: BigQuery: Remove code that tried to rename column names to make them valid (fixes #84)
Credits and contributors
The development of dbcrossbar
has been generously supported by Faraday, which provides datascience and AI products for B2C companies.
Ongoing development of dbcrossbar
is also supported by Kidd Software LLC, which creates custom software for businesses around the world.
Contributors
dbcrossbar
is primarily maintained by Eric Kidd. Other contributors include:
- Bill Morris
- Forrest Wallace
- Prithaj Nath
- Seamus Abshere
We have also received very helpful bug reports and feature requests from our users. Thank you!