Tech Overview

September 7, 2012

This overview documents the architecture of the project..

There are 3 independent modules in this project: scrubbing engine, web interface, and statistical tool:

Scrubbing engine:

this is the hardest part of the project.. Ruby/Mechanize is so far the best tool for me to scrub a webpage and Ruby/MySQL can talk to a background data storage server. At beginning, only one server with 3 processes was running, and could scan ~20,000 receipts in a single day. After bringing the site online, the number of server is increased to 4 with 18 concurrent processes, in a single day today, more than 100,000 receipts can be scanned.

Statistical Tool:

there is good and bad about R.. R is a free and easy to config statistic tool running in both Win/Unix environment. It is my best choice also because it generates pretty graphs using “ggplot2”. Current configuration is that the web server talks to a complete independent server running only R to send request and retrieve graphs. The R server also communicates with the database server to retrieve the data. It is somehow inefficient in setup (also geographic concerns), but ensures R plots FAST!

Web interface:

web design is my weakness.. PHP/MySQL is the only scripting combination I am able to program. I also need a good talent on HTML/CSS.. maybe HTML5??

One Comment
  1. Huayue permalink

    you mean ‘Scraping’ Engine?

