We have developed the crawler and parsers for this project. This system would crawl the web and store the content into multi terabyte storage system and then parsers would mine/extract the relevant piece of information from the HTML content and it will aggregate them and generate a node for publishing.
Very few samples from the list we have done in this domain,
- Address Extraction
- Country Guessing for a URL
- Company Name Extraction
- Phone/Email Extraction
- Image Extraction

