Note: This is a repost, the original one was published on SharePoint Blogs on Apr 6, 2007
A few month ago when we made some experiments with content deployment feature of MOSS 2007, we found that article pages that contain accentuated letters in their meta-information (like Title or Publishing Content) sometimes deployed to the target server with incorrect content. It seems that encoding of the text is wrong on the page. After a search with Google we found some forum and blog posts that complain for similar issues.
Most of the users complained that the non-breaking space ( ) is converted to character "Â" on the target system. Since our language (Hungarian) has several accentuated letters that are also converted to strange double-character letters, this problem is more frustrating for us.
We decided to investigate the source of the problem and now I would like to share the results with you.
The content deployment is one of the several under-documented features of MOSS. We found that when a content deployment job runs on the source server, the content is exported to the temporal folder you specified on the Central Administration site (Operations / Content Deployment Settings /Temporary Files / Path). The format of the export package seems to be identical to the format of a simple export package you get when using the Run() method of an SPExport object to create an export file. You may want to read Ton Stegeman’s great article about how to do this (Exporting and importing SharePoint 2007 content using the object model).
Since the extract files created during the content deployment are really temporary, if you have a small site with few content, it is not easy to catch the package after the export but before the import finished. To tell the truth it seems to us that sometimes there are no files created at all in this process. Maybe MOSS can handle smaller exports in memory, but maybe the file is there but it’s deleted by the time we could catch it.
We found that it’s much easier to use the export to catch the output files than hunting the temporary files during the content deployment. Since the similarities of the export and content deployment processes we chose the first one to generate packages for our investigation.
The export package is basically a .CAB file that contains a relatively large Manifest.xml and some smaller .XML files and several .DAT files. The .DAT files are actually holding the content itself (let it be an .ASPX page, a document or either an image), while the manifest contains the information required to reproduce the site content from the .DAT files, including the meta-information like content types and item properties.
We checked the content of the manifest and found that there are two parallel sections for each publishing page.
One of these sections is for describing the list item object:
<SPObject Id="2c1ad535-3bff-43ce-8b0c-69eef3425fe5" ObjectType="SPListItem" ParentId="fc0d7053-b70e-4603-8fd3-39e5d40533f9" ParentWebId="87493b75-a13c-47cf-9d01-1f52112209b5" ParentWebUrl="/" Url="/Pages/default.aspx">
<ListItem FileUrl="Pages/default.aspx" DocType="File" ParentFolderId="44ca4ae0-0dd3-4705-a460-b4d98b5ea5b4" Id="2c1ad535-3bff-43ce-8b0c-69eef3425fe5" ParentWebId="87493b75-a13c-47cf-9d01-1f52112209b5" ParentListId="fc0d7053-b70e-4603-8fd3-39e5d40533f9" Name="default.aspx" DirName="Pages" IntId="1" DocId="a51a0a90-647b-4d90-be78-42c3fe94caff" Version="5.0" ContentTypeId="0x010100C568DB52D9D0A14D9B2FDCC96666E9F2007948130EC3DB064584E219954237AF390064DEA0F50FC8C147B0B6EA0636C4A7D40015B6CA5925E96F4596C8AE31EF05B195" Author="1073741823" ModifiedBy="1073741823" TimeLastModified="2007-04-04T15:10:15" TimeCreated="2007-03-28T11:52:04" ModerationStatus="Approved">
<Field Name="_ModerationComments" FieldId="34ad21eb-75bd-4544-8c73-0e08330291fe" />
<Field Name="Modified_x0020_By" FieldId="822c78e3-1ea9-4943-b449-57863ad33ca9" />
<Field Name="Created_x0020_By" FieldId="4dd7e525-8d6b-4cb4-9d3e-44ee25f973eb" />
<Field Name="File_x0020_Type" Value="aspx" FieldId="39360f11-34cf-4356-9945-25c44e68dade" />
<Field Name="HTML_x0020_File_x0020_Type" FieldId="0c5e0085-eb30-494b-9cdd-ece1d3c649a2" />
<Field Name="_SourceUrl" FieldId="c63a459d-54ba-4ab7-933a-dcf1c6fadec2" />
<Field Name="_SharedFileIndex" FieldId="034998e9-bf1c-4288-bbbd-00eacfc64410" />
<Field Name="Title" Value="This is the title" FieldId="fa564e0f-0c70-4ab9-b863-0177e6ddd247" />
The other one for the file object:
<SPObject Id="a51a0a90-647b-4d90-be78-42c3fe94caff" ObjectType="SPFile" ParentId="44ca4ae0-0dd3-4705-a460-b4d98b5ea5b4" ParentWebId="87493b75-a13c-47cf-9d01-1f52112209b5" ParentWebUrl="/" Url="/Pages/default.aspx">
<File Url="Pages/default.aspx" Id="a51a0a90-647b-4d90-be78-42c3fe94caff" ParentWebId="87493b75-a13c-47cf-9d01-1f52112209b5" ParentWebUrl="/" Name="default.aspx" ListItemIntId="1" ListId="fc0d7053-b70e-4603-8fd3-39e5d40533f9" ParentId="44ca4ae0-0dd3-4705-a460-b4d98b5ea5b4" TimeCreated="2007-03-28T11:52:41" TimeLastModified="2007-04-04T15:08:46" Version="5.0" SetupPath="SiteTemplates\BLANKINTERNET\default.aspx" SetupPathVersion="3" SetupPathUser="1073741823" FileValue="0000000E.dat" ModifiedBy="1073741823">
<Property Name="vti_cachedtitle" Type="String" Access="ReadOnly" Value="This is the title" />
<Property Name="ContentTypeId" Type="String" Access="ReadWrite" Value="0x010100C568DB52D9D0A14D9B2FDCC96666E9F2007948130EC3DB064584E219954237AF390064DEA0F50FC8C147B0B6EA0636C4A7D40015B6CA5925E96F4596C8AE31EF05B195" />
<Property Name="vti_cachedneedsrewrite" Type="Boolean" Access="ReadOnly" Value="false" />
<Property Name="vti_parserversion" Type="String" Access="ReadOnly" Value="220.127.116.1118" />
<Property Name="vti_charset" Type="String" Access="ReadOnly" Value="utf-8" />
<Property Name="vti_title" Type="String" Access="ReadOnly" Value="This is the title" />
Notice that there are redundant information in these sections as the same property value (in this case the Title) is exported both as the value of a list item field and as a value of a file property. Since the values are identical in this case, this causes no problem.
But what happens if we use special characters, like accentuated letters in the field value? Let’s check the title with the expression we use to test applications for Hungarian language compatibility. This expression is "Árvíztűrő tükörfúrógép" and could be translated to English as "Flood-resistant mirror drill". The words in this expression contain all the accentuated letters exist in our language.
Setting this expression as the title of an article page, and exporting the content of the pages folder we found in the manifest.xml that although the Title field for the list item contains the correct value ("Árvíztűrő tükörfúrógép") the value of the vti_title property is transformed as "ÃrvÃztÅ±rÅ‘ tÃ¼kÃ¶rfÃºrÃ³gÃ©p". When we restored the content of the package to another SPS web application the title of the page displayed the correct value until we manually removed the XML node containing the Title field value from the Manifest.xml . In this case the title of the page displayed the converted value. So it seems that normally the Title field value takes precedence over the redundant vti_title property, but if the first value is missing (or the precedence is mismatched) the incorrect value may be displayed.
Using the export method we were able to reproduce the non-breaking space conversion problem too. If we set the HTML source of a page content to "non breaking space" the Manifest.xml in the export package contains "nonÂ breakingÂ space" for the PublishingPageContent file property value and "non breaking space" for the PublishingPageContent list item field value. In the case of "non breaking space" list item field value the spaces are not standard spaces (char code 32) but non-breaking spaces (character code 160).
We have not found yet why the normal working or precedence the correct values over the incorrect ones goes wrong sometimes during the content deployment, but concentrated on why and how the incorrect values are produced in the Manifest.xml. The next part of this article will focus on this topic.