Skip to content

A library and CLI tool to find differences between a primary and secondary data source

License

Notifications You must be signed in to change notification settings

arturom/datadiff

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

datadiff

Go Report Card

Datadiff is a library and CLI tool to find differences between two data sources. This is useful when there is a primary data source and a secondary data source and they both need to contain the same records.

This tool considers two data sources to be qual if they contain the same numeric IDs. This approach does not compare any other field value.

Strategy

Rather than comparing record by record, this library compares the histograms of the numeric IDs from both sources. These are the steps taken:

  • Create a histogram of the numeric IDs from the primary data source.
  • Create a histogram of the numeric IDs from the secondary data source.
  • Merge and compare the histograms.
  • If the bin capacities are full, mark this range as resolved.
  • Fetch the histogram of the unresolved bins with smaller bin sizes.
  • Merge and compare the histograms.
  • Fetch the ids of the unresolved bins.
  • Compare the numeric IDs of unresolved bins and output the results.

Supported Data Sources

  • mysql
  • elasticsearch

Usage

Run datadiff -h to get usage information

$ ./datadiff -h
Usage of ./datadiff:
  -interval int
        Initial histogram interval (default 1000)
  -mconf string
        Primary configuration string (default "{}")
  -mconn string
        Primary connection string
  -mdriver string
        Primary driver [elasticsearch|mysql]
  -sconf string
        Secondary configuration string (default "{}")
  -sconn string
        Secondary connection string
  -sdriver string
        Secondary driver [elasticsearch|mysql]

Sample Command Line Usage

 datadiff -interval 200 \
 -mdriver 'mysql' \
 -mconn 'root:root@(localhost:3306)/my_db_name?charset=utf8' \
 -mconf '{"table_name":"my_table_name", "field_name":"my_id_field_name", "conditions":["`active` = 1", "`user_id` = 100"]}' \
 -sdriver 'elasticsearch' \
 -sconn 'http://localhost:9200' \
 -sconf '{"index":"my_index_name", "type":"my_type_name", "field":"my_id_field_path"}'
mysql://root:root@localhost:3306/dbname?table=tablename&field=id
es://http://localhost:9200?index=indexname&field=id

About

A library and CLI tool to find differences between a primary and secondary data source

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published