Tumblelog by Soup.io
Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.
23:01

What's the fastest way to compare two large CSVs against each other?

So here's the nut. Imagine you've got two huge CSV files, snapshots from the same database taken on different days.

They share a common identifier, and always have the same set of fields. But there are amendments, omissions and additions made between snapshots that can only be detected by comparing records against each other. What's the fastest way to loop through a fresh snapshot and compare it against the previous snapshot for changes, additions and omissions?

Below is a roughed out Python routine I've written with a fake data set. Basically, it sets the unique ID as the key to a dictionary that contains what amounts to a CSV DictReader pull from two imaginary CSV files: one with the older "existing" snapshot; and another with the newer "fresh" snapshot.

It seems to work okay in testing, but when you run it over a large data set, it goes pretty darn slow. I'm curious whether anyone here knows a way I can make it run quicker.

existing_dict = {
    'A1': {'name': 'John', 'gender': 'M', 'value': 10},
    'A2': {'name': 'Jane', 'gender': 'F', 'value': 10},
    'A3': {'name': 'Josh', 'gender': 'M', 'value': 20},
    'A4': {'name': 'John', 'gender': 'M', 'value': 20},
    'A5': {'name': 'Janet', 'gender': 'F', 'value': 15},
    'A6': {'name': 'January', 'gender': 'F', 'value': 10},
}

fresh_dict = {
    'A2': {'name': 'Jane', 'gender': 'F', 'value': 10},
    'A3': {'name': 'Josh', 'gender': 'M', 'value': 20},
    'A4': {'name': 'John', 'gender': 'M', 'value': 20},
    'A5': {'name': 'Janet', 'gender': 'F', 'value': 15},
    'A6': {'name': 'January', 'gender': 'F', 'value': 5},
    'A7': {'name': 'Jessica', 'gender': 'F', 'value': 10},
}

def compare():
    """
    Compares two data dicts against each other.
    """
    # Set some counters to report outcome
    nochanges, amendments, omissions = 0,0,0

    # Loop through the existing data...
    for id_, existing_data in existing_dict.items():
        # Try to find the corresponding record in the fresh data
        fresh_data = fresh_dict.get(id_, None)
        # If it's there...
        if fresh_data:
            # Determine if there are any changes
            if is_diff(existing_data, fresh_data):
                amendments += 1
            else:
                nochanges += 1
            del fresh_dict[id_]
        else:
            omissions += 1
    additions = len(fresh_dict.keys())
    return nochanges, amendments, omissions, additions

def is_diff(existing_row, fresh_row):
    change_list = [field for field in existing_row.keys()
        if existing_row.get(field) != fresh_row.get(field)]
    if change_list:
        return True
    return False

if __name__ == '__main__':
    print "No change:%s; Amendments:%s; Omissions:%s; Additions:%s;" % compare()

That's pretty much it. Here's what the imaginary CSV files might look like, if it helps.

existing.csv

id,name,gender,value
A1,John,M,10
A2,Jane,F,10
A3,Josh,M,20
A4,John,M,20
A5,Janet,F,15
A6,January,F,10

fresh.csv

id,name,gender,value
A2,Jane,F,10
A3,Josh,M,20
A4,John,M,20
A5,Janet,F,15
A6,January,F,5
A7,Jessica,F,10

Don't be the product, buy the product!

Schweinderl