For enterprises implementing a content or document management system, converting legacy Word documents into a more open and structured format like XML can be a difficult task.
In a Webcast hosted Thursday by Toyko-based JustSystems, an information management software provider, the company advised enterprises to start potential Word-to-XML transitions by carefully planning whether the conversion is necessary as well as dividing legacy content into categories.
“One of the most important things to do is involve the team, making sure you have your subject matter experts in place, which includes technical writers and developers that understand structured authoring,” Jeff Deskins, principal consultant at JustSystems, said. “Things are going to change, as moving from unstructured content to structured content provides you with new processes, new technologies and new policies that need to be in place.”
Deskins said that good planning will take care of costs and ultimately, reduce potential team frustrations. He said the earlier that enterprises prepare to manage the changes, the better the project tends to go.
“Companies need to look at how much legacy data they’re interested in converting, do they want to do everything or just the information that is frequently used,” Deskins said. “You want to look at how much content you have in comparison to the product lifecycle support that is needed for that information. So, maybe you if you have something that has a short shelf life, you will be OK keeping that in its old format.”
According to Rizwan Virk, co-founder and chief technologist of XML Publishing at CambridgeDocs, once companies have decided to convert via automated, topic-based authoring, it needs to spend the majority of its focus on categorizing its documents.
“By dividing your content into categories, you can figure out which parts can be easily converted in an automated fashion and which parts cannot be,” Virk said.
Depending on the volume of the content, enterprises may choose to look at all the content or a random sampling, Virk said. In an ideal situation, a company would only have a few thousand pages it wanted to convert and could easily peruse through them to categorize.
“On the other hand, I remember we had one pharmaceutical client that had 40,000 different Word documents that it needed to convert to XML,” Virk said. “So we just basically had to pick and choose samples from each content category.”
Consistency in formatting and structure is also important in the categorization process and, according to Virk, could be another determining factor in deciding whether conversion is a viable option.
Virk said that authors using Word documents don’t always use the correct styles and format that they should. And when doing an automated conversion, consistency of style and formatting across multiple documents is key to success.
“If you have a thousand pages and they’re all formatted differently, then automation might not be such a great idea,” Virk said. “On the other hand, if you have 10,000 pages and they’re all formatted fairly similarly, then you can do automation with minimal cleanup afterwards.”
The actual conversion, which some believe to be the heart of the matter, is actually fairly simple if planning and categorization has been properly addressed, Virk said.