perfectxml.com
 Basic Search  Advanced Search   
Topics Resources Free Library Software XML News About Us
  You are here: home Info Bank Articles » MSXML Tips :: September 2002 Monday, 25 February 2008
 

MSXML SAX C++ Samples

MSXML Tips: September 2002

By: Darshan Singh (Managing Editor, perfectxml.com)
Download Sample Applications

Back to Articles Page      

        

(Page 1 of 5)  Next Page >>   

Introduction

XML is all about data. And it is important to get to that data quickly and effectively. The data is sandwiched between sets of angled brackets and/or as attribute values. Instead of having everyone to write code to do textual processing on XML document, standard APIs are defined to parse and process XML documents. There are many freely-available XML parsers (or processors) that implement standard XML parsing APIs allowing us to create, manipulate, and search XML documents. For Microsoft platforms, we can use MSXML (also known as Microsoft XML Core Services) to work with XML documents. MSXML is a COM-based XML processing component, freely available from Microsoft Web site, and also shipped with various products. More details on MSXML can be found at http://www.perfectxml.com/msxml.asp.

 

W3C (www.w3.org) has defined an XML parsing API known as DOM (Document Object Model). It is an abstract API that treats XML document as an in-memory tree. With DOM, the entire XML document is first loaded in memory, before it can be processed. Once the document is loaded in memory, the API allows various operations, such as searching, updating, deleting nodes, and so on. This method might not work best for huge XML documents, as loading the entire document in memory will need a lot of system resources. See following Microsoft KB Articles for more details on this:

 

-         PRB: Loading Large XML Files into the XML DOM Drains System Resources (Q266228)

-         System Does Not Respond When You Open Large XML Files in Internet Explorer (Q269488)

 

Members of the XML-DEV mailing list (www.xml.org/xml/xmldev.shtml) started a discussion on having a new streaming XML processing API that would read XML document character-by-character and generating events as something interesting happened, such as start of the document, start of the element, end of the element, end of the document, and so on. This discussion was lead by David Megginson and resulted in a formalization of a set of interfaces to create a streaming, event-based, push XML processing API called SAX or Simple API for XML. Today, there are many parsers (including MSXML) available that implement SAX allowing us to parse XML documents as a stream of bytes instead of loading the entire XML document in memory. This method works best for huge documents or when there is no need to load the entire XML document in memory. MSXML release 3.0 first introduced support for SAX2. The subsequent MSXML releases have enhanced the SAX support.

 

See Choosing between SAX and DOM for some details on benefits and tradeoffs of using SAX or DOM to process XML documents

 

Note: The .NET Framework does not implement SAX, but rather it offers a different (and better) streaming API that is based on the pull-model (instead of push-model as in SAX). This streaming XML parsing API is supported via the XmlReader class in the System.Xml namespace. See the KB Article "INFO: Roadmap for Programming XML with the Pull-Model Parser in the .NET Framework (Q313816)" fore more help on this.

 

In this month's MSXML tip, we'll look at four C++ sample applications that illustrate using SAX2 implementation in MSXML 4.0 SP1.

 

-         The first application illustrates doing basic SAX processing to count occurrences of the specified element in the given XML document.

-         The second sample application shows XSD validation using SAX.

-         The third application shows locating an element and getting the element's content.

-         The fourth and final application illustrates creating XML documents using SAX.

 

The first three applications are MFC Dialog based applications, while the last one is a console application. All the applications use the #import directive and use smart pointers to work with MSXML components.

 

More details on SAX can be found at www.perfectxml.com/msxmlSAX.asp. Also be sure to check out the home page for SAX at www.saxproject.org.

 

Application 1: Counting Given Element

As mentioned earlier, SAX is an event-based (or push) API. As the parser process the XML document, it generates events, and it's the application writer's task to handle those events. To handle events, we need to implement certain interfaces predefined by SAX. The basic SAX parsing interface is ContentHandler. The parser uses this interface to send events such as startDocument, startElement, characters, endElement, endDocument, and so on.

 

Figure 1.1 SAX Parsing Model

 

The first step is to implement the ContentHandler interface, then create an instance of XMLReader class, associate the ContentHandler implementation with it and start parsing. As the XMLReader is parsing the XML document, it will send parsing events by calling the methods in our implementation.

 

In this application, we'll use MSXML 4.0 SP1. If you do not have it installed, see www.perfectxml.com/msxmlFiles.asp for details.

 

This sample application is a MFC dialog based application. It accepts XML file name and element name, parses the XML file using SAX and counts the number of occurrences of the specified element.

 

Figure 1.2 Sample Application 1: Locate and count elements

 

 

Here are the steps to create the above application.

 

1.   Start Visual C++ 6.0; create a new MFC AppWizard (exe) Dialog-based project and call it MSXML_SAX2_Example1

 

2.   Write the following lines in the stdafx.h

 

#import <msxml4.dll> raw_interfaces_only

using namespace MSXML2;

 

3.   Let's now implement the ISAXContentHandler interface. We'll first create a header file and a CPP file and have a blank implementation of ISAXContentHandler interface (each method just returning S_OK). We can then reuse this class to derive from it and just implement the methods required for current implementation. Create a new header file called SAXHandlersBase.h and write the following class declaration in it:

 

#if !defined(_PERFECTXML_MSXML4SAXHANDLERS_BASE_H_)

#define _PERFECTXML_MSXML4SAXHANDLERS_BASE_H_

 

#include "stdafx.h"

 

class ContentHandlerImplBase: public ISAXContentHandler

{

public:

      ContentHandlerImplBase();

      virtual ~ContentHandlerImplBase();

 

    long __stdcall QueryInterface(const struct _GUID &,void ** );

    unsigned long __stdcall AddRef(void);

    unsigned long __stdcall Release(void);

 

    virtual HRESULT STDMETHODCALLTYPE putDocumentLocator(

            /* [in] */ ISAXLocator __RPC_FAR *pLocator);

       

    virtual HRESULT STDMETHODCALLTYPE startDocument( void);

       

    virtual HRESULT STDMETHODCALLTYPE endDocument( void);

       

    virtual HRESULT STDMETHODCALLTYPE startPrefixMapping(

            /* [in] */ wchar_t __RPC_FAR *pwchPrefix,

            /* [in] */ int cchPrefix,

            /* [in] */ wchar_t __RPC_FAR *pwchUri,

            /* [in] */ int cchUri);

       

    virtual HRESULT STDMETHODCALLTYPE endPrefixMapping(

            /* [in] */ wchar_t __RPC_FAR *pwchPrefix,

            /* [in] */ int cchPrefix);

       

    virtual HRESULT STDMETHODCALLTYPE startElement(

            /* [in] */ wchar_t __RPC_FAR *pwchNamespaceUri,

            /* [in] */ int cchNamespaceUri,

            /* [in] */ wchar_t __RPC_FAR *pwchLocalName,

            /* [in] */ int cchLocalName,

            /* [in] */ wchar_t __RPC_FAR *pwchRawName,

            /* [in] */ int cchRawName,

            /* [in] */ ISAXAttributes __RPC_FAR *pAttributes);

       

    virtual HRESULT STDMETHODCALLTYPE endElement(

            /* [in] */ wchar_t __RPC_FAR *pwchNamespaceUri,

            /* [in] */ int cchNamespaceUri,

            /* [in] */ wchar_t __RPC_FAR *pwchLocalName,

            /* [in] */ int cchLocalName,

            /* [in] */ wchar_t __RPC_FAR *pwchRawName,

            /* [in] */ int cchRawName);

       

    virtual HRESULT STDMETHODCALLTYPE characters(

            /* [in] */ wchar_t __RPC_FAR *pwchChars,

            /* [in] */ int cchChars);

       

    virtual HRESULT STDMETHODCALLTYPE ignorableWhitespace(

            /* [in] */ wchar_t __RPC_FAR *pwchChars,

            /* [in] */ int cchChars);

       

    virtual HRESULT STDMETHODCALLTYPE processingInstruction(

            /* [in] */ wchar_t __RPC_FAR *pwchTarget,

            /* [in] */ int cchTarget,

            /* [in] */ wchar_t __RPC_FAR *pwchData,

            /* [in] */ int cchData);

       

    virtual HRESULT STDMETHODCALLTYPE skippedEntity(

            /* [in] */ wchar_t __RPC_FAR *pwchName,

            /* [in] */ int cchName);

     

private:

      ULONG m_refCnt;

};

 

#endif //   !defined(_PERFECTXML_MSXML4SAXHANDLERS_BASE_H_)

 

4.   The above class derives from the ISAXContentHandler interface, and defines all the methods in IUnknown and ISAXContentHandler interfaces. We'll implement only the IUnknown methods; all the ISAXContentHandler interface methods will just return S_OK. This class is created so that in future then we can just derive from this class, and not worry about implementing IUnknown methods and the ISAXContentHandler interface methods that we are not interested in. Create a new .CPP file named SAXHandlersBase.cpp and write the following class implementation code in it:

 

#include "stdafx.h"

#include "SAXHandlersBase.h"

 

//    ---------------------------------------------------------------------------

ContentHandlerImplBase::ContentHandlerImplBase()

{

      m_refCnt=0;

}

 

//    ---------------------------------------------------------------------------

ContentHandlerImplBase::~ContentHandlerImplBase()

{

}

 

 

//    ---------------------------------------------------------------------------

long __stdcall ContentHandlerImplBase::QueryInterface(const struct _GUID &riid, void** ppvObject)

{

      *ppvObject = NULL;

      if (riid == IID_IUnknown ||riid == __uuidof(ISAXContentHandler))

      {

            *ppvObject = static_cast<ISAXContentHandler *>(this);

      }

     

      if (*ppvObject)

      {

            AddRef();

            return S_OK;

      }    

      else return E_NOINTERFACE;

}

 

//    ---------------------------------------------------------------------------

unsigned long __stdcall ContentHandlerImplBase::AddRef()

{

       return ++m_refCnt; // NOT thread-safe

}

 

//    ---------------------------------------------------------------------------

unsigned long __stdcall ContentHandlerImplBase::Release()

{

      --m_refCnt; // NOT thread-safe

   if (m_refCnt == 0) {

        delete this;

        return 0; // Can't return the member of a deleted object.

   }

   else return m_refCnt;

}

 

 

//    ---------------------------------------------------------------------------

HRESULT STDMETHODCALLTYPE  ContentHandlerImplBase::startDocument ( )

{

      return S_OK;

}

 

 

//    ---------------------------------------------------------------------------

HRESULT STDMETHODCALLTYPE  ContentHandlerImplBase::endElement (

    unsigned short * pwchNamespaceUri,

    int cchNamespaceUri,

    unsigned short * pwchLocalName,

    int cchLocalName,

    unsigned short * pwchQName,

    int cchQName )

{

      return S_OK;

}

 

//    ---------------------------------------------------------------------------

HRESULT STDMETHODCALLTYPE  ContentHandlerImplBase::putDocumentLocator (struct ISAXLocator* pLocator)

{

      return S_OK;

}

 

//    ---------------------------------------------------------------------------

HRESULT STDMETHODCALLTYPE  ContentHandlerImplBase::endDocument ( )

{

      return S_OK;

}

 

//    ---------------------------------------------------------------------------

HRESULT STDMETHODCALLTYPE  ContentHandlerImplBase::startPrefixMapping (

    unsigned short * pwchPrefix,

    int cchPrefix,

    unsigned short * pwchUri,

    int cchUri )

{

      return S_OK;

}

 

//    ---------------------------------------------------------------------------

HRESULT STDMETHODCALLTYPE  ContentHandlerImplBase::endPrefixMapping (

    unsigned short * pwchPrefix,

    int cchPrefix )

{

      return S_OK;

}

 

//    ---------------------------------------------------------------------------

HRESULT STDMETHODCALLTYPE  ContentHandlerImplBase::startElement (

    unsigned short * pwchNamespaceUri,

    int cchNamespaceUri,

    unsigned short * pwchLocalName,

    int cchLocalName,

    unsigned short * pwchQName,

    int cchQName,

    struct ISAXAttributes * pAttributes )

{

      return S_OK;

}

 

//    ---------------------------------------------------------------------------

HRESULT STDMETHODCALLTYPE  ContentHandlerImplBase::characters (

    unsigned short * pwchChars,

    int cchChars )

{

      return S_OK;

}

 

//    ---------------------------------------------------------------------------

HRESULT STDMETHODCALLTYPE  ContentHandlerImplBase::ignorableWhitespace (

    unsigned short * pwchChars,

    int cchChars )

{

      return S_OK;

}

 

//    ---------------------------------------------------------------------------

HRESULT STDMETHODCALLTYPE  ContentHandlerImplBase::processingInstruction (

    unsigned short * pwchTarget,

    int cchTarget,

    unsigned short * pwchData,

    int cchData )

{

      return S_OK;

}

 

//    ---------------------------------------------------------------------------

HRESULT STDMETHODCALLTYPE  ContentHandlerImplBase::skippedEntity (

    unsigned short * pwchName,

    int cchName )

{

      return S_OK;

}

 

Now, we have a class (ContentHandlerImplBase) from which we can derive from and just implement the methods required for current SAX processing, all other methods will be used from the above base class (returning S_OK) and even the component lifetime management and QueryInterface is taken care of by the above class (by implementing IUnknown methods).

 

5.   Let's create one more header .h file and CPP file; declare and implement a class that derives from the above class and implement the functionality to count the occurrences of the specified element. Create a new header file named SAXHandlers.h and write the following class declaration in it.

 

#if !defined(_PERFECTXML_MSXML4SAXHANDLERS_H_)

#define _PERFECTXML_MSXML4SAXHANDLERS_H_

 

#include "stdafx.h"

#include "SAXHandlersBase.h"

class ContentHandlerImpl: public ContentHandlerImplBase

{

public:

      long m_lElemCount;

      TCHAR m_szElemToMatch[MAX_PATH];

     

      ContentHandlerImpl();

      virtual ~ContentHandlerImpl();

 

    virtual HRESULT STDMETHODCALLTYPE startDocument( void);

       

    virtual HRESULT STDMETHODCALLTYPE endElement(

            /* [in] */ wchar_t __RPC_FAR *pwchNamespaceUri,

            /* [in] */ int cchNamespaceUri,

            /* [in] */ wchar_t __RPC_FAR *pwchLocalName,

            /* [in] */ int cchLocalName,

            /* [in] */ wchar_t __RPC_FAR *pwchRawName,

            /* [in] */ int cchRawName);

};

 

#endif //   !defined(_PERFECTXML_MSXML4SAXHANDLERS_H_)

 

6.   The above class derives from the class created in steps 3 and 4 (ContentHandlerImplBase) and it defines two public member variables: an element count and the element name to search in the XML document. Note that in this class we do not implement all the ISAXContentHandler interface methods, but just the methods that we'll use to count the specified element. In startDocument we'll reset the count to 0 and in endElement, we'll match the element name with the current element name being parsed, and if they match, we increment the count.

 

7.   Create a new CPP file named SAXHandlers.cpp and write following code in it:

 

#include "stdafx.h"

#include "SAXHandlers.h"

 

//    ---------------------------------------------------------------------------

ContentHandlerImpl::ContentHandlerImpl()

{

      m_lElemCount = 0;

      m_szElemToMatch[0]=0;

}

 

//    ---------------------------------------------------------------------------

ContentHandlerImpl::~ContentHandlerImpl()

{

}

 

 

//    ---------------------------------------------------------------------------

HRESULT STDMETHODCALLTYPE  ContentHandlerImpl::startDocument ( )

{

      m_lElemCount=0;

      return S_OK;

}

 

 

//    ---------------------------------------------------------------------------

HRESULT STDMETHODCALLTYPE  ContentHandlerImpl::endElement (

    unsigned short * pwchNamespaceUri,

    int cchNamespaceUri,

    unsigned short * pwchLocalName,

    int cchLocalName,

    unsigned short * pwchQName,

    int cchQName )

{

      TCHAR szCurElement[MAX_PATH+1] = {0};

      wcstombs(szCurElement, pwchQName, cchQName);

     

      if(_tcscmp(szCurElement, m_szElemToMatch) == 0)

            m_lElemCount++;

      return S_OK;

}

 

 

8.   We have our ContentHandler implementation ready. At this time update the default dialog as per Figure 1.2 and write the following code under the "Do Counting" button click handler:

(Page 1 of 5)  Next Page >>   

  Contact Us | E-mail Us | Site Guide | About PerfectXML | Advertise ©2004 perfectxml.com. All rights reserved. | Privacy