search
Japanese Chinese Nederlands Espanol Italiano Deutsch Francais Twitter Rss Feeds
MicrosoftArticlesForumsFAQs
C# .NET
VB.NET
Visual Studio .NET
ADO.NET
Xml / Xslt
VB 6.0
.NET CF
GDI+
LINQ
Deployment
Security
FoxPro
Silverlight / WPF
Entity Framework
RIA Services

Web ProgrammingArticlesForumsFAQs
JavaScript
ASP
ASP.NET
Web Services

Non-MicrosoftArticlesForumsFAQs
NHibernate
Perl
PHP
Ruby
Java
Linux / Unix
Apple
Open Source

DatabasesArticlesForumsFAQs
SQL Server
Access
Oracle
MySQL
Other Databases

OfficeArticlesForumsFAQs
Excel
Word
Powerpoint
Outlook
Publisher
Money

Operating SystemsArticlesForumsFAQs
Windows 7
Windows Server
Windows Vista
Windows XP
Windows Update
MAC
Linux / UNIX

Server PlatformsArticlesForumsFAQs
BizTalk
Site Server
Exhange Server
IIS

Graphic DesignArticlesForumsFAQs
Macromedia Flash
Adobe PhotoShop
Expression Blend
Expression Design
Expression Web

OtherArticlesForumsFAQs
Subversion / CVS
Ask Dr. Dotnetsky
Active Directory
Networking
Uninstall Virus
Job Openings
Product Reviews
Search Engines
Resumes

 

Comparing Datasets using Linq


By Yuri Kasan
Printer Friendly Version
View My Articles
39 Views
    

How to compare two datasets using Linq to DataSet


One of my recent projects included simple and I believe, commonly done tasks.  We were implementing project utilizing in Microsoft Visual Studio Professional edition, with C# 3.0 and the task was to compare two snapshots taken from the same table with particular time interval. The purpose of this task was to determine if any changes where done on the server side within given time interval.

The first way that came to my mind was to use the DataSet.Merge() method.  Ideally, this is how it should work:

At some point of the application initialization, we “grab” the first snapshot of data and load it into the original DataSet (let’s call it dsOriginal), then on the Timer event, when the given interval has passed, we take second snapshot (let’s call it dsChanged). The rest should be a fairly routine procedure:

1.       Create an empty DataSet (we will call it dsDifferences).
DataSet dsDifferences = new Dataset();

2.       Merge our dsOriginal with dsChanged
dsOriginal.Merge(dsChanged);

3.       The last step is to get the modified or added rows using the DataSet.GetChanges() method
dsDifferences = dsOriginal.GetChanges();

 

This is a very “stylish” and simple solution, but not so easy as in concept.
We were unpleasantly surprised when we found out that merge datasets do not change RowSatus property of the rows in the tables of the dataset.  Just to make sure that we did nothing wrong, I looked at MSDN library again and the MSDN materials confirmed that the Merge method combines rows (equivalent  to SQL Union statement) when no Primary Key present, or updates rows in target DataSet using source Dataset when Primary Key was defined. To resolve this issue we contacted the Microsoft Support team and after a few days of email exchange, we got the following message: I think the root of the issue is that Merge does not change row state.  If you create a dataset, then manually change a row value, then row state is tweaked and GetChanges() behaves as expected.”

But Merge does a “dumb” merge and does not set other flags from that we are seeing, but I am trying to find documentation to support that.  If that is the case, the only option is to do a manual compare.  I will keep you posted on what I find…

As you can see from the message that Microsoft suggestion was to go manually trough both DataSets comparing values row by row and manually update settings in the resulting Dataset using Row.SetAdded() or Row.SetModified()  methods according to the situation – a solution not as elegant. The fact that the method provided above is available for developers in Framework 2.0 and later for extra flexibility,  it is hard to imagine when these methods can be used in practical terms. After a day or two of additional research and conversations with Microsoft , we received another email from the support team, where we got the clue that the issue cannot be resolved: “I have raised the issue with appropriate personnel and I am waiting to hear back from them.  I will let you know something as soon as I find out more.

That’s when we turned our heads to LINQ.  I did not have extensive experience with this new feature, but in a few hours found it very easy to learn and apply. There are a few ways using LINQ to achieve the same result. For our purpose, we selected involvement of Union method. Therefore, here is the “moment of Zen”:

  1. During Initialization of our application, we still create an initial snapshot of the data
    in the sample below we will call it dsOriginal.
  2. On the Elapsed event of the timer we  create another snapshot of current data state
    we will call it dsChanged.
  3. To use LINQ we need to get the tables from both datasets into an enumerable row collection. This can be done using the AsEnumerable() method (see code sample below) and we are ready to use LINQ magic.

Below is the code, which completes the task:

 

var orig = dsOriginal.Tables[0].AsEnumerable();

var updated = dsChanged.Tables[0].AsEnumerable();

//First, getting new records if any           

var newRec =

            from u in dsChanged

            where !(from o in orig

                     select o.Field<decimal>("PRIMARY_KEY_FIELD"))

                 .Contains(u.Field<decimal>(" PRIMARY_KEY_FIELD"))

                        select new

                        {

                           prim_key = u.Field<decimal>("PRIMARY_KEY_FIELD"),

                           field1 = u.Field<decimal>("FIELD1"),

                           field2=u.Field<decimal>("FIELD2"),

                           field3 = u.Field<decimal>("FIELD3"),

                           field4 = u.Field<decimal>("FIELD4"),

                           rec_type="A"//Added

                        };

 

//Secondly, getting updated records

 var updRec = from u in updated

              join o in orig

              on u.Field<decimal>("PRIMARY_KEY_FIELD")

              equals o.Field<decimal>("PRIMARY_KEY_FIELD")
             where (u.Field<decimal>("FIELD1") !=
                  o.Field<decimal>("FIELD1")) ||
                  (u.Field<decimal>("FIELD2") !=
                  o.Field<decimal>("FIELD2"))

                         select new

                         {

                           prim_key = u.Field<decimal>("PRIMARY_KEY_FIELD"),

                           field1 = u.Field<decimal>("FIELD1"),

                           field2=u.Field<decimal>("FIELD2"),

                           field3 = u.Field<decimal>("FIELD3"),

                           field4 = u.Field<decimal>("FIELD4"),

                           rec_type = "M"//Mofified

                         };

var Union = newRec.Union(updRec);

 

 

The code snippet is simple and self-explanatory. In the first selected statement we are getting new records by using Contains() method  preceding  it with the “!” operator which gives us only new(Added) rows. In the second statement we get the rows with updated values using comparison in the Where clause, and finally, we compare the results using Union() method.  

 

You probably noticed the variable rec_type used in both queries. When you need to know which record was modified and which is new (added), the Union operator comes in handy because it allows you to create a custom flag on DataRow with this information, but if you only need to get differences without knowing if the row added or modified, you can get all the data in “one shot” and the code snippet below demonstrates how:

 

 

 

 

 

 

var AddedAndModif = from u in updated

                        where !(from o in orig

                                select o.Field<decimal>("PRIMARY_KEY"))

                               .Contains(u.Field<decimal>("PRIMARY_KEY"))

                        || !(from o in orig

                                select o.Field<decimal>("FIELD1"))

                               .Contains(u.Field<decimal>(“FIELD1"))

                        select new

                        {

                           prim_key = u.Field<decimal>("PRIMARY_KEY"),

                           field1 = u.Field<decimal>("FIELD1"),

                           field2=u.Field<decimal>("FIELD2"),

                           field3 = u.Field<decimal>("FIELD3"),

                           field4 = u.Field<decimal>("FIELD4"),

                        };

 

 

As you can see from this example, the Contains() method used twice solves the task and provides you with collection of both, modified and newly added rows.

 

 

In our opinion, among all other options, this solution appears to be fastest.  It is easy to read and modify as needed and showed about 20% better performance when comparing it to the same result achieved by the method suggested by MS Support professionals  provided at the top of this article (comparing row by row and manually changing RowSate using  Row.SetAdded() or Row.SetModified() methods).

 

Since the situation provided in this discussion is common, we decided to share our solution, thinking that it might make another developers life easier.

Aaron Katz - Certified PMP,
Yuri Kasan - MCAD


Biography - Yuri Kasan
Yuri Kasan, is the lead developer of the corporate intranet at Mellon Investor Services, LLC. Yuri, who is an Microsoft Certified Developer specializes in the implementation of a wide variety of applications using Microsoft technologies. His expertise includes assisting his users in maximizing their investment in Microsoft desktop, server and development technologies. You can reach Yuri via email at ykasan@yahoo.com

button
Article Discussion: Comparing Datasets using Linq
Yuri Kasan posted at Monday, October 13, 2008 8:22 AM
Original Article
 

Expanding Code to support deleting and diffgram generation
Romain Lagrange replied to Yuri Kasan at Tuesday, March 03, 2009 12:32 PM
Thanks for this usefull post.

Here is my two cents to add some additional functionalities :
- Finding deleted rows (not tested but quite easy)
- Multi Table support
- Generate a diffgram.


static public class PublicationFactory
{
    /// <summary>
    /// Create a diffgram from two datasets
    /// diffgram contains the list of row to create, update and delete to pass from old to new step
    /// old and new datasets must have the same schema
    /// the old dataset give the list of table to compute
    /// Is supposed that all data have a ToString() method wich render a representative value of the data
    /// Is supposed that a column named "PK" got a unique string identifier
    /// </summary>
    /// <param name="oldDataSet">Old step</param>
    /// <param name="newDataSet">Current step</param>
    /// <returns>diffgram as dataset</returns>
    static public DataSet DeltaMaker(DataSet oldDataSet, DataSet newDataSet)
    {
        DataSet datasetdiff;
        DataRow tempRow;
        DataTable tempTable;
        string tempTableName;
        datasetdiff = new DataSet();

        // concat all data of each field, in order to compare content
        TTSDataHelper.BeginSerializeHelper(oldDataSet);
        TTSDataHelper.BeginSerializeHelper(newDataSet);

        for (int i = 0; i < oldDataSet.Tables.Count; i++)
        {
            tempTableName = oldDataSet.Tables[i].TableName;
            IEnumerable<DataRow> orig = oldDataSet.Tables[tempTableName].AsEnumerable();
            IEnumerable<DataRow> updated = newDataSet.Tables[tempTableName].AsEnumerable();

            IEnumerable<DataRow> Added = // in updated, wich don't exist in origin
                from u in updated
                where !(from o in orig
                        select o.Field<string>("PK"))
                    .Contains(u.Field<string>("PK"))
                select u;
            IEnumerable<DataRow> Modif = //  in updated, data different from origin
                from u in updated
                where !(from o in orig
                        select o.Field<string>("SERIALIZED"))
                    .Contains(u.Field<string>("SERIALIZED"))
                    && !(from a in Added select a.Field<string>("PK")) // ignore just added
                         .Contains(u.Field<string>("PK"))
                select u;
            IEnumerable<DataRow> Deleted = // in origin, non existing in updated
                from o in orig
                where !(from u in updated
                        select u.Field<string>("PK"))
                    .Contains(o.Field<string>("PK"))
                select o;

            datasetdiff.Tables.Add(oldDataSet.Tables[tempTableName].Clone());
            tempTable = datasetdiff.Tables[tempTableName];
            tempTable.BeginLoadData();

            // Create, accept then delete
            foreach (DataRow row in Deleted)
            {
                tempRow = tempTable.Rows.Add(row.ItemArray);
                tempRow.AcceptChanges();
                tempRow.Delete();
            }

            // Create added
            foreach (DataRow row in Added)
            {
                tempRow = tempTable.Rows.Add(row.ItemArray);
                tempRow.AcceptChanges();
                tempRow.SetAdded();
            }
            // Create modified
            foreach (DataRow row in Modif)
            {
                tempRow = tempTable.Rows.Add(row.ItemArray);
                tempRow.AcceptChanges();
                tempRow.SetModified();
            }
            tempTable.EndLoadData();

            tempTable.Columns.Remove("SERIALIZED");
        }
        
        //remove data comparison helper
        TTSDataHelper.EndSerializeHelper(oldDataSet);
        TTSDataHelper.EndSerializeHelper(newDataSet);

        return datasetdiff;
    }
}
internal static class TTSDataHelper
{
    public static char separator = '§';

    public static void BeginSerializeHelper(DataSet dataset)
    {
        string serialized;
        foreach (DataTable table in dataset.Tables)
        {
            table.Columns.Add("SERIALIZED", typeof(string));
            foreach (DataRow row in table.Rows)
            {
                serialized = string.Empty;
                for (int i = 0; i < table.Columns.Count - 2; i++)
                {
                    serialized += row[i].ToString() + separator;
                }
                row["SERIALIZED"] = serialized;
            }
        }
    }

    public static void EndSerializeHelper(DataSet dataSet)
    {
        foreach (DataTable table in dataSet.Tables)
        {
            table.Columns.Remove("SERIALIZED");
        }
    }
}
 

  Better way
Tetsuo Shima replied to Romain Lagrange at Friday, October 30, 2009 4:08 AM
Hi there,

thx for that great method. I searched a long time for that, because the DataSet.Merge function does not fullfill my requirements either. There is no chance for me to retrieve a DataSet.GetChanges() because I loaded fresh DBF-files and try to merge them.

So I try this method. Nice, but for my requirements to slow. I getted timeouts at the LINQ-queries. So I pimp the LINQ queries and it looks now like this:

static public class PublicationFactory
{
/// <summary>
/// Create a diffgram from two datasets
/// diffgram contains the list of row to create, update and delete to pass from old to new step
/// old and new datasets must have the same schema
/// the old dataset give the list of table to compute
/// Is supposed that all data have a ToString() method wich render a representative value of the data
/// Is supposed that a column named "PK" got a unique string identifier
/// </summary>
/// <param name="oldDataSet">Old step</param>
/// <param name="newDataSet">Current step</param>
/// <returns>diffgram as dataset</returns>
static public DataSet DeltaMaker(DataSet oldDataSet, DataSet newDataSet)
{
DataSet datasetdiff;
DataRow tempRow;
DataTable tempTable;
string tempTableName;
datasetdiff = new DataSet();

// concat all data of each field, in order to compare content
TTSDataHelper.BeginSerializeHelper(oldDataSet);
TTSDataHelper.BeginSerializeHelper(newDataSet);

for (int i = 0; i < oldDataSet.Tables.Count; i++)
{
tempTableName = oldDataSet.Tables[i].TableName;
IEnumerable<DataRow> orig = oldDataSet.Tables[tempTableName].AsEnumerable();
IEnumerable<DataRow> updated = newDataSet.Tables[tempTableName].AsEnumerable();

List<decimal> listOrig =
(from o in orig
select o.Field<decimal>("recordNumber")).ToList<decimal>();
IEnumerable<DataRow> Added = // in updated, wich don't exist in origin
from u in updated
where !(listOrig.Contains(u.Field<decimal>("recordNumber")))
select u;
List<string> listOrigSerialized =
(from o in orig
select o.Field<string>("SERIALIZED")).ToList<string>();
List<decimal> listAdded =
(from a in Added
select a.Field<decimal>("recordNumber")).ToList<decimal>();
IEnumerable<DataRow> Modif = // in updated, data different from origin
from u in updated
where !(listOrigSerialized.Contains(u.Field<string>("SERIALIZED")))
&& !(listAdded.Contains(u.Field<decimal>("recordNumber"))) // ignore just added
select u;
List<decimal> listUpdated =
(from u in updated
select u.Field<decimal>("recordNumber")).ToList<decimal>();
IEnumerable<DataRow> Deleted = // in origin, non existing in updated
from o in orig
where !(listUpdated.Contains(o.Field<decimal>("recordNumber")))
select o;

datasetdiff.Tables.Add(oldDataSet.Tables[tempTableName].Clone());
tempTable = datasetdiff.Tables[tempTableName];
tempTable.BeginLoadData();

// Create, accept then delete
foreach (DataRow row in Deleted)
{
tempRow = tempTable.Rows.Add(row.ItemArray);
tempRow.AcceptChanges();
tempRow.Delete();
}

// Create added
foreach (DataRow row in Added)
{
tempRow = tempTable.Rows.Add(row.ItemArray);
tempRow.AcceptChanges();
tempRow.SetAdded();
}
// Create modified
foreach (DataRow row in Modif)
{
tempRow = tempTable.Rows.Add(row.ItemArray);
tempRow.AcceptChanges();
tempRow.SetModified();
}
tempTable.EndLoadData();

tempTable.Columns.Remove("SERIALIZED");
}

//remove data comparison helper
TTSDataHelper.EndSerializeHelper(oldDataSet);
TTSDataHelper.EndSerializeHelper(newDataSet);

return datasetdiff;
}
}
internal static class TTSDataHelper
{
public static char separator = '§';

public static void BeginSerializeHelper(DataSet dataset)
{
string serialized;
foreach (DataTable table in dataset.Tables)
{
table.Columns.Add("SERIALIZED", typeof(string));
foreach (DataRow row in table.Rows)
{
serialized = string.Empty;
for (int i = 0; i < table.Columns.Count - 2; i++)
{
serialized += row[i].ToString() + separator;
}
row["SERIALIZED"] = serialized;
}
}
}

public static void EndSerializeHelper(DataSet dataSet)
{
foreach (DataTable table in dataSet.Tables)
{
table.Columns.Remove("SERIALIZED");
}
}
}

Please mention that on this example it is not PK of type int. I used recordNumber of type decimal as primary key. Just change all decimals with your type of primary key and the column name of it.
If you like you can delete the tables, which retrieve no changes. Put following code before you return the datasetdiff:

            for (int i = datasetdiff.Tables.Count-1; i >= 0; i--)
            {
                if (datasetdiff.Tables[i].Rows.Count < 1)
                {
                    datasetdiff.Tables.Remove(datasetdiff.Tables[i]);
                }
            }

Thats all. Thx.