|
.Net HTML Parser - Parse your HTML content with speed and ease
What is HTMLParser.Net?
HTMLParser.Net is a .Net library built on codebase of popular javabased
HTMLParser available on sourceforge.net. If you are building
applications that involve screen scrapping of HTML pages or data extraction
from the web sites, then you definitely want to have a tool like HTMLParser.Net
in your arsenal. Parsing of a page is as simple as writing 4 lines of code and
you are on your way home. And if you want to little bit more creative with your
parsing and query of results, then the API offer more advanced features that
are easy to use.
Community Edition
We offer a community edition of the library for free download. This edition has all the features of
professional edition version except support for mime types likes PDF, MS office documents, Xml etc. and
multithreaded crawling capabilities. But if your needs are limited to text/html mime type then this is a great
library to keep in your tool chest.
Features
Feature list of the API includes
-
You can use it with any .Net language (C#,VB.Net,J# etc.)
-
Parses almost all the HTML tags and allows you to search based on tag types,
attribute values or regular expression search in the content. There were some
tags that were not supported by javabased HTMLParser project. We have included
those in this release.
-
Set of extensible filters that allows you to filter the content that you do not
want to include in your analysis.
-
High level APIs that allow you to get answers to common questions like, What
are outbound links in the page, What are images in the page, What are different
tables on the page, Are there any broken links on the page and much more.
-
A configuration file based Http protocol engine that extracts the content from
the URL that you specified. The crawler follows the instructions in robots.txt
file of that site and does not get the content if site blocks that page.
-
Http protocol engine is fully capable of handling compressed response sent from
any site. it accepts gzip, x-zip and deflate mime types.
Commercial Use Of Parser
Following are the links for some of the web scaper APIs and applications that we have built for our clients. This is
not a complete list. These examples are here to show you some of the real life uses of our parser.
Releases
V4.0(Pro Editio) 6/1/2010
- Upgraded to .Net 4.0 framework
- Fixed assembly attributes to comply with .Net 4.0 security changes
- Bug fixes and optmization changes
V3.5(Pro Edition) 2/28/2009
- Added ability to specify parser configuration per instance of object. This made multi-threaded
use of the library possible.
- Optimized code for .Net3.0 and higher
- Enabled use of Proxy server settings
V3.2 Release (Pro and Lite Edition) 9/10/2007
- Added new filters in API
- Fixed bugs in parser that were fixed in its counter part java version
V3.1 Release (Pro and Lite Edition) 9/15/2006
- All bug fixes from Pro version has been rolled into Lite version
- New filters have been added
- Ability to override configuration settings at run time
V3.0.1 Release (Pro Edition) - 8/22/2006
- Added new XOR filter as released in V1.6 java library
- Added new NodeTreeWalker class as released in V1.6 of java library
- Added support to override some settings in configuration file at run time.
V2.1.42 Release (Pro Edition) - 8/14/2006
- Fixed the bug when empty request attributes were supplied and request URL was constructed wrong
- Added cookie container support
V2.1.41 Release (Pro Edition) - 7/25/2006
- Added capability to parse "table" on page and create a DataTable object from it.
V2.1.39 Release (Pro Edition) - 7/1/2006
- Professional Version released.
- Added capability to specify request parameters for POST or GET type of requests
- Full capability to parse PDF, MS Office documents
- Full capability to handle compressed response from server
- New APIs added that facilitate development of multi-threaded and scalable document crawler.
Community Edition
V1.8.0 Release - 8/21/2006
- Added 2 new filters as requested by lot of users
V1.7.3.0 Release - 3/6/2006
- Tag name change for link tag to "ATag".
- Bug fixed for issue when charset was switched to some value that was not understood by framework.
Community Edition - Release
- Release community edition of parser. This is a free download for educational as well as commercial use
V1.6.13.0 Release - 2/10/2006
- Bug fix where a string could not be used source for parsing
V1.6.12.0 Release - 1/16/2006
- Added capability to ignore robots.txt settings
- Added capability to delay fetching of pages.
- Bug fixes
V1.6.8.0 Release - 1/1/2006
- Added new APIs to analyze a page.
- Performance enhancements and bug fixes.
V1.6.5.0 Release - 12/30/2005
- Added handling of deflate content-encoding to Http protocol engine.
- Minor bug fixing
V1.6.3.0 Release - 12/22/2005
-
First public release of the library
|