How it works

dbcrossbar uses pluggable input and output drivers, allowing any input to be copied to any output:

%3csv_inCSVdbcrossbardbcrossbarcsv_in->dbcrossbarcsv_outCSVdbcrossbar->csv_outpostgres_outPostgreSQLdbcrossbar->postgres_outbigquery_outBigQuerydbcrossbar->bigquery_outs3_outS3dbcrossbar->s3_outetc_out...dbcrossbar->etc_outpostgres_inPostgreSQLpostgres_in->dbcrossbarbigquery_inBigQuerybigquery_in->dbcrossbars3_inS3s3_in->dbcrossbaretc_in...etc_in->dbcrossbar

Parallel data streams

Internally, dbcrossbar uses parallel data streams. If we copy s3://example/ to csv:out/ using --max-streams=4, this will run up to 4 copies in parallel:

%3src1s3://example/file_1.csvdest1csv:out/file_1.csvsrc1->dest1src2s3://example/file_2.csvdest2csv:out/file_2.csvsrc2->dest2src3s3://example/file_3.csvdest3csv:out/file_3.csvsrc3->dest3src4s3://example/file_4.csvdest4csv:out/file_4.csvsrc4->dest4

As soon as one stream finishes, a new one will be started:

%3src5s3://example/file_5.csvdest5csv:out/file_5.csvsrc5->dest5

dbcrossbar accomplishes this using a stream of CSV streams. This allows us to make extensive use of backpressure to control how data flows through the system, eliminating the need for temporary files. This makes it easier to work with 100GB+ CSV files and 1TB+ datasets.

Shortcuts

When copying between certain drivers, dbcrossbar supports "shortcuts." For example, it can load data directly from Google Cloud Storage into BigQuery.

Multi-threaded, asynchronous Rust

dbcrossbar is written using asynchronous Rust, and it makes heavy use of a multi-threaded worker pool. Internally, it works something like a set of classic Unix pipelines running in parallel. Thanks to Rust, it has been possible to get native performance and multithreading without spending too much time debugging.