Tumblelog by Soup.io
Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

January 28 2011

16:44

Ruby screen scraping tutorials

Mark Chapman has been busy translating our Python web scraping tutorials into Ruby.

They now cover three tutorials on how to write basic screen scrapers, plus extra ones on using .ASPX pages, Excel files and CSV files.

We’ve also installed some extra Ruby modules – spreadsheet and FastCSV – to make them possible.

These Ruby scraping tutorials are made using ScraperWiki, so you can of course do them from your browser without installing anything.

Thanks Mark!


June 29 2010

23:01

What's the fastest way to compare two large CSVs against each other?

So here's the nut. Imagine you've got two huge CSV files, snapshots from the same database taken on different days.

They share a common identifier, and always have the same set of fields. But there are amendments, omissions and additions made between snapshots that can only be detected by comparing records against each other. What's the fastest way to loop through a fresh snapshot and compare it against the previous snapshot for changes, additions and omissions?

Below is a roughed out Python routine I've written with a fake data set. Basically, it sets the unique ID as the key to a dictionary that contains what amounts to a CSV DictReader pull from two imaginary CSV files: one with the older "existing" snapshot; and another with the newer "fresh" snapshot.

It seems to work okay in testing, but when you run it over a large data set, it goes pretty darn slow. I'm curious whether anyone here knows a way I can make it run quicker.

existing_dict = {
    'A1': {'name': 'John', 'gender': 'M', 'value': 10},
    'A2': {'name': 'Jane', 'gender': 'F', 'value': 10},
    'A3': {'name': 'Josh', 'gender': 'M', 'value': 20},
    'A4': {'name': 'John', 'gender': 'M', 'value': 20},
    'A5': {'name': 'Janet', 'gender': 'F', 'value': 15},
    'A6': {'name': 'January', 'gender': 'F', 'value': 10},
}

fresh_dict = {
    'A2': {'name': 'Jane', 'gender': 'F', 'value': 10},
    'A3': {'name': 'Josh', 'gender': 'M', 'value': 20},
    'A4': {'name': 'John', 'gender': 'M', 'value': 20},
    'A5': {'name': 'Janet', 'gender': 'F', 'value': 15},
    'A6': {'name': 'January', 'gender': 'F', 'value': 5},
    'A7': {'name': 'Jessica', 'gender': 'F', 'value': 10},
}

def compare():
    """
    Compares two data dicts against each other.
    """
    # Set some counters to report outcome
    nochanges, amendments, omissions = 0,0,0

    # Loop through the existing data...
    for id_, existing_data in existing_dict.items():
        # Try to find the corresponding record in the fresh data
        fresh_data = fresh_dict.get(id_, None)
        # If it's there...
        if fresh_data:
            # Determine if there are any changes
            if is_diff(existing_data, fresh_data):
                amendments += 1
            else:
                nochanges += 1
            del fresh_dict[id_]
        else:
            omissions += 1
    additions = len(fresh_dict.keys())
    return nochanges, amendments, omissions, additions

def is_diff(existing_row, fresh_row):
    change_list = [field for field in existing_row.keys()
        if existing_row.get(field) != fresh_row.get(field)]
    if change_list:
        return True
    return False

if __name__ == '__main__':
    print "No change:%s; Amendments:%s; Omissions:%s; Additions:%s;" % compare()

That's pretty much it. Here's what the imaginary CSV files might look like, if it helps.

existing.csv

id,name,gender,value
A1,John,M,10
A2,Jane,F,10
A3,Josh,M,20
A4,John,M,20
A5,Janet,F,15
A6,January,F,10

fresh.csv

id,name,gender,value
A2,Jane,F,10
A3,Josh,M,20
A4,John,M,20
A5,Janet,F,15
A6,January,F,5
A7,Jessica,F,10

April 20 2010

19:43

Joomla to Mediawiki help

My programming skills are near nonexistent, so I need some serious help here.

I have some pages (about 150) in a Joomla install that I want to open up to collaborative editing. I figure the best way to do this is to put the content of those Joomla pages into a wiki. Since both are based on MySQL databases, I figure exporting, converting and importing is my best bet.

(I'm figuring I'll do the export and import using CSV files and phpMyAdmin, because I know how to do that and learning the database commands seems more difficult than it's worth here. Ditto for massive regexing, plus the fact that I don't know all the tags that might be lurking in the content.)

Perl seems to have the best combination of capabilities, but the CSV import extension returns an array of hashes. I know that data types can nest somewhat in Perl, but I have no idea how to traverse the hashes in the array to get the two fields out of the Joomla database export. Concatenating the fields should be simple, and the HTML-to-Wiki conversion extension looks reasonably easy. But again, I don't know to create the CSV to import into the Wiki database.

PHP has a function to read CSV a line at a time, making the variable much less complex, but I can't find a converter from HTML to MediaWiki.

Python has been a bust so far as finding anything.

So, any direction? Things I have missed or not considered?

Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.

Don't be the product, buy the product!

Schweinderl