Search This Blog

Sunday, April 12, 2009

Reading XML Schemas

XML information is everywhere in these times.  We heavily rely on XML to send and analyse information.  After dealing with XML in every aspect of development I realised that the internet is loaded with information on how to read, write and manipulate XML in every possible language (script) there is.  And to make things worse, I myself have a post up on how to read XML documents using DOM.

This motivated me to write this document and to give XML Schemas the respect they require.  What are XML schemas?  They can be simply explained as a skeleton for a given XML file.  An XML file can have any desired structure, XML schemas are what restrict them and keep them well-formed.  So an XML schema is something like a parent which looks after it’s child (the XML document).

Let us look at an example:

<?xml version="1.0" encoding="utf-8" ?>
<blog>
<topic date="2003-08-22">My first XML document</topic>
<topic date="2006-10-18">My second XML document</topic>
</blog>

The schema for the above XML is:

<?xml version="1.0" encoding="utf-8"?>
<xs:schema attributeFormDefault="unqualified" elementFormDefault="qualified" xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:element name="Blog">
<xs:complexType>
<xs:sequence>
<xs:element maxOccurs="unbounded" name="Topic">
<xs:complexType>
<xs:simpleContent>
<xs:extension base="xs:string">
<xs:attribute name="Date" type="xs:date" use="required" />
</xs:extension>
</xs:simpleContent>
</xs:complexType>
</xs:element>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:schema>

Now that we are familiar with XML schemas let us look at the actual reason of this post.  Given an XML schema, how do we read it?  How do we store the values within the schema, its relationships and other mandatory information required by XML data files.

The way I addressed this problem is using two passes:


  1. Reading all the information given in XML schema
    This was necessary as there can be attribute and element references throughout the file.  Therefore we first need to store all the information about the XML schema and then determine the relationships between each of the data elements.

  2. Forming relationships between elements
    After we store all the information about the XML schema, we need to traverse through it to form the relationships.  Relationships in this aspect represent the nested nature of the XML schema.

Let us look at how we start reading an XML schema.  For that we have to understand a little about the different aspects of an XML schema.  A good introduction is given by the W3C schools here.  Let me briefly go over the main types here as well just to have a reference to them:


  1. Simple types
    1. Elements
    2. Attributes
    3. Restrictions

  2. Complex types
    1. Elements
    2. Empty
    3. Elements only
    4. Text only
    5. Mixed
    6. Indicators
    7. <any>
    8. <anyAttribute>
    9. Substitution

  3. Data types

I decided that each of the above types can be represented by data structures which I can understand to make things simpler.  You don’t have to do this, you can simply rely on the .NET representation of these schema types.  Let us look at how I chose to represent each of these structures.  Let me introduce a class diagram which we will analyse a bit later.


image


The above diagram shows three interfaces which help us make things much simpler and it also abstracts us away from the .NET representation the schema:


  1. IAttribute
    Any attribute like schema object will inherit from this interface.  Note that I said “attribute type”, as I also want simple types (both implicitly and explicitly declared) to be used as attributes.

  2. IElementComplex
    Represents any complex type object within the schema.  This includes both, implicitly and explicitly declared types.

  3. IRelationship
    Merely a wrapper around the complex element type which makes us easier to form relationships between them which we will use in our second pass.

Note that the above classes were developed a while ago while working on my Part IV engineering project (in 2005), only now have I realised that there is little to no help on how to read an XML schema, therefore I decided to publish this post.  By no means is this “the only” or “the best” way.  And also I am sure there are better ways of doing this.

Let me briefly give you an example of what implicitly and explicitly declared schema elements are with the XML schema below:

<?xml version="1.0" encoding="utf-8"?>
<xs:schema id="ImplicityExplicit" xmlns="http://tempuri.org/ImplicityExplicit.xsd" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:mstns="http://tempuri.org/ImplicityExplicit.xsd" elementformdefault="qualified" targetnamespace="http://tempuri.org/ImplicityExplicit.xsd">
<!--Explicit complex type-->
<xs:complextype name="CTEntity" />

<!--Implicit complex type-->
<xs:element name="CTElemEntity">
<xs:complextype></xs:complextype>
</xs:element>

<!--Explicit simple type-->
<xs:simpletype name="STEntity">
<xs:restriction base="xs:string" />
</xs:simpletype>

<!--Implicit simple type-->
<xs:element name="STElemEntity">
<xs:simpletype>
<xs:restriction base="xs:string" />
</xs:simpletype>
</xs:element>

<!--Implicit shy simple type-->
<xs:element name="STElem" />

<!--An attribute-->
<xs:attribute name="attribute" />
</xs:schema>

Now that we know the basic different types of XML schema elements let us look at how we can read/extract them.  For this we rely heavily on the .NET XML classes.  Reading the XML schema is quite simple, it is done using the following code:

using System;
using System.Xml;
using System.Xml.Schema;

namespace XmlLibrary
{
public class SchemaReader
{
public String FileName { get; set; }
public XmlSchema Schema { get; set; }

private bool SchemaLoaded { get; set; }

public SchemaReader(string fileName)
{
FileName = fileName;
ReadSchema();
}

public void ExtractElements()
{
if (SchemaLoaded && Schema != null)
{
XmlSchemaObjectCollection itemsColl = Schema.Items;
foreach (XmlSchemaObject item in itemsColl)
{
// TODO: Implementation
}
}
}

private void ReadSchema()
{
if (String.IsNullOrEmpty(FileName))
{
SchemaLoaded = false;
Schema = null;
}
else
{
try
{
Schema = XmlSchema.Read(new XmlTextReader(FileName), null);
SchemaLoaded = true;
}
catch (XmlException)
{
SchemaLoaded = false;
Schema = null;
}
catch (XmlSchemaException)
{
SchemaLoaded = false;
Schema = null;
}
}
}

}
}

The important method in the above code is the “ReadSchema” call.  Note that it takes in a string as a parameter which points to the schema file on disk and then reads this in the populates the System.Xml.Schema.XmlSchema object.  From this point forward we will only be concentrating on the ExtractElements method.


Let us examine the ExtractElements method closely.  Currently it is stubbed out, although if you remember I mentioned that XML has a nested/recursive nature to it, therefore we expect to call a distinctive method recursively.  Let us examine how we do it.  After fleshing out the method a bit further we have the following:

public void ExtractElements()
{
if (SchemaLoaded && Schema != null)
{
XmlSchemaObjectCollection itemsColl = Schema.Items;
foreach (XmlSchemaObject item in itemsColl)
{
ManipulateSchemaObject(item);
}
}
}

private void ManipulateSchemaObject(XmlSchemaObject schemaObject)
{
IndentLevel++;
if (schemaObject is XmlSchemaElement || schemaObject is XmlSchemaComplexType || schemaObject is XmlSchemaSimpleType)
{
#region XmlSchemaElement
if (schemaObject is XmlSchemaElement)
{
XmlSchemaElement schemaElement = schemaObject as XmlSchemaElement;
Console.WriteLine(String.Format("{0}XmlSchemElement: {1} ({2})", GetIndentationString(), schemaElement, schemaElement.Name));

ManipulateSchemaElement(schemaElement);
}
#endregion XmlSchemaElement
#region XmlSchemaComplexType
else if (schemaObject is XmlSchemaComplexType)
{
XmlSchemaComplexType explicitComplexType = schemaObject as XmlSchemaComplexType;
Console.WriteLine(String.Format("{0}XmlSchemaComplexType: {1} ({2})", GetIndentationString(), explicitComplexType, explicitComplexType.Name));
}
#endregion XmlSchemaComplexType
#region XmlSchemaSimpleType
else if (schemaObject is XmlSchemaSimpleType)
{
XmlSchemaSimpleType explicitSimpleType = schemaObject as XmlSchemaSimpleType;
Console.WriteLine(String.Format("{0}XmlSchemaSimpleType: {1} ({2})", GetIndentationString(), explicitSimpleType, explicitSimpleType.Name));
}
#endregion XmlSchemaSimpleType
}
else
{
#region AttributeGroup
if (schemaObject is XmlSchemaAttributeGroup)
{
XmlSchemaAttributeGroup schemaAttGroup = schemaObject as XmlSchemaAttributeGroup;
Console.WriteLine(String.Format("{0}XmlSchemaAttributeGroup: {1} ({2})", GetIndentationString(), schemaAttGroup, schemaAttGroup.Name));
}
#endregion AttributeGroup
#region Attribute
else if (schemaObject is XmlSchemaAttribute)
{
XmlSchemaAttribute schemaAtt = schemaObject as XmlSchemaAttribute;
Console.WriteLine(String.Format("{0}XmlSchemaAttribute: {1} ({2})", GetIndentationString(), schemaAtt, schemaAtt.Name));
}
#endregion Attribute
#region Group
else if (schemaObject is XmlSchemaGroup)
{
XmlSchemaGroup schemaGroup = schemaObject as XmlSchemaGroup;
Console.WriteLine(String.Format("{0}XmlSchemaGroup: {1} ({2})", GetIndentationString(), schemaGroup, schemaGroup.Name));
}
#endregion Group
else
{
Console.WriteLine(GetIndentationString() + schemaObject + " is not handled yet");
}
}
}

#region schema element
private void ManipulateSchemaElement(XmlSchemaElement schemaElement)
{
XmlSchemaType schemaElementSchemaType = schemaElement.SchemaType;

#region schema type defined
if (schemaElementSchemaType != null)
{
Type xmlSchemaTypeType = schemaElementSchemaType.GetType();
#region Complex type
if (schemaElementSchemaType is XmlSchemaComplexType)
{
ManipulateSchemaElementComplexType(schemaElement);
}
#endregion Complex type
#region Simple type
else if (schemaElementSchemaType is XmlSchemaSimpleType)
{
ManipulateSchemaElementSimpleType(schemaElement);
}
#endregion Simple type
}
#endregion schema type defined
#region Element with no schematype, thus simple
else
{
if (schemaElement.RefName.IsEmpty) // not a reference
{
Console.WriteLine(String.Format("{0}Implicitly implicit simple type: {1}", GetIndentationString(), schemaElement.Name));
}
}
#endregion
}
#endregion schema element

#region simple type
private void ManipulateSchemaElementSimpleType(XmlSchemaElement schemaElement)
{
XmlSchemaSimpleType simpleElement = schemaElement.SchemaType as XmlSchemaSimpleType;
Console.WriteLine(GetIndentationString() + String.Format("Simple type encountered: {0}", schemaElement.Name));
}
#endregion simple type

#region complex type
private void ManipulateSchemaElementComplexType(XmlSchemaElement schemaElement)
{
XmlSchemaComplexType complexElement = schemaElement.SchemaType as XmlSchemaComplexType;

XmlSchemaObjectCollection attColl = complexElement.Attributes;
foreach (XmlSchemaObject attCollObj in attColl)
ManipulateSchemaObject(attCollObj);

if (complexElement.Particle != null)
{
XmlSchemaParticle complexElementParicle = complexElement.Particle;
ParticleHandlingForElement(complexElementParicle);
}
else if (complexElement.ContentModel != null)
{
XmlSchemaContentModel complexElementContentModel = complexElement.ContentModel;
if (complexElementContentModel is XmlSchemaSimpleContent)
{
XmlSchemaSimpleContent contentModelSimpleType = complexElementContentModel as XmlSchemaSimpleContent;
XmlSchemaContent schemaContent = contentModelSimpleType.Content;
if (schemaContent is XmlSchemaSimpleContentExtension)
{
XmlSchemaSimpleContentExtension xmlSchemaExtension = schemaContent as XmlSchemaSimpleContentExtension;
XmlSchemaObjectCollection objColl = xmlSchemaExtension.Attributes;
foreach (XmlSchemaObject item in objColl)
{
ManipulateSchemaObject(item);
}
}
}
else
{
Console.WriteLine(GetIndentationString() + complexElementContentModel.ToString());
// Handle restrictions and extensions here (Complex or simple type)
}
}
}
#endregion complex type

#region particle handling for element
private void ParticleHandlingForElement(XmlSchemaParticle complexElementParicle)
{
Type particleType = complexElementParicle.GetType();
if (complexElementParicle is XmlSchemaSequence)
{
XmlSchemaSequence particleAsSeq = complexElementParicle as XmlSchemaSequence;
ManipulateSchemaSequence(particleAsSeq);
}
else if (particleType.Equals(typeof(XmlSchemaChoice)))
{
XmlSchemaChoice complexChoice = complexElementParicle as XmlSchemaChoice;
ManipulateSchemaChoice(complexChoice);
}
}

#region choice
private void ManipulateSchemaChoice(XmlSchemaChoice groupChoice)
{
XmlSchemaObjectCollection objColl = groupChoice.Items;
foreach (XmlSchemaObject objCollObj in objColl)
{
XmlSchemaObject schemaObj = objCollObj;
if (schemaObj.GetType().Equals(typeof(XmlSchemaSequence)))
{
XmlSchemaSequence choiceSeq = schemaObj as XmlSchemaSequence;
ManipulateSchemaSequence(choiceSeq);
}
else if (schemaObj.GetType().Equals(typeof(XmlSchemaChoice)))
{
XmlSchemaChoice choiceChoice = schemaObj as XmlSchemaChoice;
ManipulateSchemaChoice(choiceChoice);
}
else
ManipulateSchemaObject(schemaObj);
}
}
#endregion choice

#region Sequence
private void ManipulateSchemaSequence(XmlSchemaSequence choiceSeq)
{
XmlSchemaObjectCollection objColl = choiceSeq.Items;
foreach (XmlSchemaObject objCollObj in objColl)
ManipulateSchemaObject(objCollObj);
}
#endregion Sequence

There are quite a lot of methods in the above snippet, although they are quite straight forward and most of them recursively call our ExtractElements method.  Let us look at each of the methods which handle each different aspect of XML schema.

  1. ManipulateSchemaObject(XmlSchemaObject schemaObject)
    Handles various schema objects, this is the recursive method call which is centre of our XML schema reader.  This method acts like a proxy which spreads out the flow depending on the type of the XML schema element.
  2. ManipulateSchemaElement(XmlSchemaElement schemaElement)
    Handles various schema elements, Complex and Simple.
  3. ManipulateSchemaElementSimpleType(XmlSchemaElement schemaElement)
    Handles the simple type.
  4. ManipulateSchemaElementComplexType(XmlSchemaElement schemaElement)
    Handles the complex type.
  5. ParticleHandlingForElement(XmlSchemaParticle complexElementParicle)
    Handles particles, which includes sequences and choice elements.

We have looked at how to cleanly handle the XML schema and recursively traverse through its structure.  I’ll get back to this topic in a later post and show you how to I formed relationships between my custom schema based data structures using custom business objects.

No comments:

Post a Comment