LINQ to XML and reading large XML files

LINQ to XML makes it relatively easy to read and query XML files. For example consider the following XML file:

<xml version="1.0" encoding="utf-8" ?>
    <users>
        <user name="User1" groupid="4" />
        <user name="User2" groupid="1" />
        <user name="User3" groupid="3" />
        <user name="User4" groupid="1" />
        <user name="User5" groupid="1" />
        <user name="User6" groupid="2" />
        <user name="User7" groupid="1" />
    </users>

Suppose you would like to find all records with groupid > 2. You could be tempted to issue the following query:

XElement doc = XElement.Load("users.xml");
    var users = from u in doc.Elements("user")
                where u.Attribute("groupid") != null &&
                int.Parse(u.Attribute("groupid").Value) > 2
                select u;
    Console.WriteLine("{0} users match query", users.Count());

There’s a flaw in this method. XElement.Load method will load the whole XML file in memory and if this file is quite big, not only the query might take long time to execute but it might fail running out of memory. If we had some really large XML files we need to buffer through it instead of reading the whole contents into memory. XmlReader is a nice alternative to allowing us to have only the current record into memory which could hugely improve performance. We start by defining a User class which will be used to represent a single record:

public class User 
    {
        public string Name { get; set; }
        public int GroupId { get; set; }
    }

Next we extend the XmlReader class with the User method:

public static IEnumerable<User> Users(this XmlReader source)
    {
        while (source.Read())
        {
            if (source.NodeType == XmlNodeType.Element && 
                source.Name == "user")
            {
                int groupId;
                int.TryParse(source.GetAttribute("groupid"), out groupId);
                yield return new User
                {
                    GroupId = groupId,
                    Name = source.GetAttribute("name")
                };
            }
        }
    }

And finally we can execute the query:

using (XmlReader reader = XmlReader.Create("users.xml"))
    {
        var users = from u in reader.Users()
                    where u.GroupId > 2
                    select u;
        Console.WriteLine("{0} users match query", users.Count());
    }

Conclusion: the second approach runs faster and uses much less memory than the first. The difference is noticeable on large XML files. So if you have to deal with large XML files be cautious when using LINQ to XML.

Leave a comment Cancel reply