|
MimeDetector
If you are seeing this section and do not see source code download links, this means that you are not logged into our site. If you already are a memeber, click on the login link
and login into site and come back to this page for downloading the control files. If you are not a member, click on registration link to
become a Winista member and download the control for free.
During the development of our HTML Parser For .Net we came across
the issue of parsing documents other than HTML files. For example while crawling a page there may
be a link that points to a PDF or Word document. That means we had to detect the mime type of those
documents. There are 2 ways to do this detection. First approach is where you trust the file
extension and parser the document based on that. Second approach which is robust is by looking at
file signatires and then determinig its type.
This problem is not just limited to our parsing situation. A lot of time content management applications
allow users to upload documents on the servers. And you need to restrict the type of files a user can
upload. If you do not detect the file type by looking at content, a user can simply fool your system by
changing the file extensions and bypassing your restrictions.
We came across these wonderful mime reader utility classes in open source Nutch crawler system. These classes
are JAVA classes. We decided to convert these classes to C# and bring it to .Net user community. This is our
first public release of the converted classes. We have wrapped them in a nice utility library which you can
use in any project.
How to use MimeDetector?
There is one XML file "mime-type.xml" that contains information about file types and the signatures used
to identify the content type. You will need this file to create instance of MimeTypes object. Once you have
created MimeTypes object, then call GetMimeType method to get MimeType of the
stream. If the mime type could not be determined then a null object is returned from this method. Following
code snippet demonstrates use of the library.
MimeTypes g_MimeTypes = new MimeTypes("mime-types.xml");
sbyte [] fileData = null;
using (System.IO.FileStream srcFile =
new System.IO.FileStream(strFile, System.IO.FileMode.Open))
{
byte [] data = new byte[srcFile.Length];
srcFile.Read(data, 0, (Int32)srcFile.Length);
fileData = Winista.Mime.SupportUtil.ToSByteArray(data);
}
MimeType oMimeType = g_MimeTypes.GetMimeType(fileData);
If you have questions or suggestion please post your comment in our
HTML Parser forum.
|