One of the requirements for my post-graduate Application Project is to obtain an updated copy of the Philippine Standard Geographic Code. There are many reasons why the PSGC is updated, you can read all about it in the
PSGC Interactive - Updating Procedures. So, I created a crawler to get the necessary data from
NSCB's PSGC Directory, starting from the
List of Regions to the detailed listings of Provinces, Cities, Municipalities and Barangays. My crawl finished around 2:30 pm I guess (Philippine Time).
The crawler was really not that complicated to make. I was able to make it by using Java and some external classes from the
jsoup project. To store the gathered data I used MySQL (naive programmers' favorite choice) with the help of its
connector for Java. The most notable obstacle that I've encountered during the making of the crawler was the old-school markup of the pages. You'll see that the markups for the pages were most of the time marked up using <tables>
that's why traversing the tags to get to the right data may come a little tricky.
Even though I only used a not-so-powerful computer, yet I was able to crawl it easy. The crawler was designed for optimal usage of processor cores, thus making the crawling more faster. The computer that I used is just a laptop with only 2 cores but the crawler was made to have an optimal performance such that the number of threads to be run for crawling will be less than one from the number of processor cores, thus leaving another core for the main process that has the same set of jobs as the threads. Just some basic parallel processing. My previous professor even called it "Naive Parallel Programming". lol
I am so happy when I checked the PSGC's
most updated summary (as of March 2013) because I had the same results of data after matching it with the data I gathered from the crawl.
For now, maybe the next thing I'm going to do is refining the data.
I hope God will give me the might to do everything I needed to be done. :-)