September 17, 2010 by Alistair Deneys
It’s inevitable, whenever you talk to techies new to Sitecore, at some point someone is going to ask to look inside the database at the database schema. I always cringe when they ask, cause I know where they’re headed. The thing with Sitecore is you have an abstracted data model that you can think in terms of. You don’t (and probably shouldn’t) think about the data in terms of what is in the database. Don’t get me wrong, it’s good to understand the database, but don’t limit your thinking to it.
This abstracted data model in Sitecore is the content tree. Breaking down into it’s constituent component you have the item, which contains a collection of fields. Realistically that’s as low as you need to go into the data model of Sitecore. It’s more efficient to think in these terms rather than the database tables.
In fact, the database schema in Sitecore is modelled on the content tree, not the other way round. I guess this is a throw back to the old days when Sitecore stored it’s content not in a database, but in an XML file. And you can see evidence of this in the database schema.
One thing that sets Sitecore apart from many other CMSs is the fact it has a really good API. You can do anything that Sitecore themselves do in the CMS using the API including creating and updating items in the content tree. And you really should use the API whenever you need to interact with the data in Sitecore.
There are many benefits to using the API as opposed to accessing the database directly using SQL. To name just a few:
The Sitecore data model provides a pluggable interface to allow storage and retrieval of content from any data store including different RDBMSs. Out of the box Sitecore contains data providers for Microsoft SQL Server, Oracle and MySQL. If you write any code directly against the database you limit the databases you can attach your site to. Better to use the API and gain the benefit of being able to swap to a different RDBMS.
The data providers contain internal caching to speed up your queries and prevent unnecessary calls to the database server. If you connect directly to the database you’ll have to implement your own caching.
More Efficient Coding
Thinking about the content at a higher level and not getting bogged down in the fine detail of it’s implementation allows you to think more efficiently about what you’re doing. The code is also a lot easier. Here is some quick example code of how to create a new item in Sitecore and populate it’s fields.
var db = Sitecore.Configuration.Factory.GetDatabase("master"); var target = db.GetItem("/sitecore/content/home"); var template = db.Templates["My Template"]; var item = target.Add("New Item", template); item.Editing.BeginEdit(); item["title"] = "The Title"; item["text"] = "The Text"; item.Editing.EndEdit();
Now compare that to the SQL you’d need to write for SQL Server.
declare @parentid uniqueidentifier, @id uniqueidentifier select @parentid = ID from Items where Name = 'Home' and ParentID = ( select ID from Items where Name = 'content' and ParentID = '11111111-1111-1111-1111-111111111111') set @id = NEWID() insert into items values (@id, 'New Item', '76036F5E-CBCE-46D1-AF0A-4143F9B557AA', '00000000-0000-0000-0000-000000000000', @parentid, GETDATE(), GETDATE()) insert into versionedfields values(NEWID(), @id, 'en', 1, '75577384-3C97-45DA-A847-81B00500E250', 'The Title', GETDATE(), GETDATE()) insert into versionedfields values(NEWID(), @id, 'en', 1, 'A60ACD61-A6DB-4182-8329-C957982CEC74', 'The Text', GETDATE(), GETDATE()) insert into versionedfields values(NEWID(), @id, 'en', 1, '5DD74568-4D4B-44C1-B513-0AF5F4CDA34F', 'sitecore\admin', GETDATE(), GETDATE()) insert into versionedfields values(NEWID(), @id, 'en', 1, 'BADD9CF9-53E0-4D0C-BCC0-2D784C282F6A', 'sitecore\admin', GETDATE(), GETDATE()) insert into versionedfields values(NEWID(), @id, 'en', 1, '25BED78C-4957-4165-998A-CA1B52F67497', '20100907T080000', GETDATE(), GETDATE()) insert into versionedfields values(NEWID(), @id, 'en', 1, 'D9CF14B1-FA16-4BA6-9288-E8A174D4D522', '20100907T080000', GETDATE(), GETDATE()) insert into versionedfields values(NEWID(), @id, 'en', 1, '52807595-0F8F-4B20-8D2A-CB71D28C6103', 'sitecore\admin', GETDATE(), GETDATE())
The C# code is much more readable. And in fact, the SQL code above is somewhat simplified cause I cheated and used the Sitecore UI to grab those field GUIDs rather than using more queries to find them.
But as with anything in development, the approach you take for anything…depends. But I’ve just given all these reasons above why you should use the API. Under what condition could you possible still consider a direct SQL approach? Data migration.
Data migration is a somewhat unique activity on a project. It’s normally only done once at the start of the project and isn’t ever used again for the normal life of the solution. That means you’re not likely to have to come back and update or maintain the code, so all the best practises can be thrown out the window en lieu of a quick and dirty solution to just get the job done.
So why do people want to use a direct SQL approach? Speed. It makes sense that accessing the database directly would be quicker than going through another layer of indirection; the API. But what performance gains can we expect from a direct SQL approach? Well I’ve done some testing to find out.
I’ve had to make some assumptions to frame the testing approach I used. I assumed the data for the migration would be in a file. I could iterate over each record in the file and either call the appropriate APIs to create the item or generate TSQL on the fly. The data being imported includes links to reference items which must be looked up and if they don’t exist they must be created. This approach is a more realistic example of what might be done in reality.
In addition to pitting direct SQL against the API I also split testing of the SQL approach into two. While I’m generating TSQL on the fly I have the opportunity to run the TSQL in batches rather than item by item. This will save network trips to the database server but in a real world scenario you may prefer not to batch so if a single record fails it doesn’t fail the whole batch.
For each of the approaches to be tested (SQL single, SQL batch, API) I took the average time to create a number of items and filling in their fields ranging from 25 up to 2000. And here are the results.
Wow! I wasn’t expecting that. My prediction before I started this testing was that direct SQL should perform better always, but I was unsure of the gain to expect through using this approach. As you can see from the above results, this is not the case. The API actually performed better than the direct SQL approach for larger numbers of item to import. The only thing I can attribute this to is the caching the data provider does. Especially for the reference items, using the API, even though I’m asking it to find this item for every item, because it’s already cached in the data provider there’s no network trip round to the database server which saves time.
So this begs the question, could I optimise the TSQL I’m generating to perform faster than the API? Perhaps. One area I see the API performing better is to do with the reference items. The data provider cache is used and shared across all the import operations whereas using SQL, any variables I create will only last for that query (single or batch). This means for the entire import operation I’m spending more time looking up reference items than if I use the API.
But I think if it comes down to trying to optimise your SQL to make it faster then you’re not going to get a huge performance increase to justify not using the API. You may spend a considerable amount of time optimising your SQL only to find that you’re at best only marginally better than the API. Don’t waste your time! And how many hundreds of hours do you think Sitecore has sunk into the development and testing of the API and the TSQL it generates? Do you really think you’ll find something they missed in the several hours your project has allocated to this activity?
So which approach should you use? Yes the direct SQL approach runs faster for small numbers of item, but how much additional time are you spending writing that TSQL rather than using the very nice API which is available to you? In my opinion, you should always use the API for your data tasks. But if you do happen to use SQL, don’t forget to clear your caches so you’ll see the data.