How do you rate the information presented on your website? Is it truly for the visitor? Do your pages provide up-to-date information that’s relevant to the visitor, regardless of where the information originated? It should!
All too often, the Web is seen as a static publishing platform, a place for whatever marketing messages managers want to communicate. Content and design are often based on the publisher’s perspective, rather than focused on end-users’ needs. With good information architecture, your website can present content in a structured manner and provide a decent, even delightful, user experience. By marking up information semantically, you’ll be able to integrate different information sources to create a website that is always up-to-date and relevant.
Information architecture is about how information is organized and accessible so as to be useful. Information is only useful when someone needs it; it’s valuable at the point of need. So how to design your own sources of information and use the data that others share to achieve a goal? As modern and wondrous as the Web is, it’s a bit behind other information management systems. In my opinion, this is because the Web (and especially intranets) until recently, has been regarded as a showcase rather than a natural part of business. Websites have often behaved like brochures, and have not even been as functional as a mail-order catalog or a self-checkout in a supermarket. Now though, C-level executives realize the Web can add significant value. For a website to make a difference to business, to customer behavior, and to the bottom-line, it requires good information architecture.
Glossary – metadata, or, descriptive metadata
Metadata is data on data or information describing other information. It may be, for example, categorization or keywords that describe the content. The word meta comes from Greek and means after or above and is often used in conversation to describe something that is self-referential. Metadata is usually a well-defined label on a higher abstraction level than the data it describes.
Content choreography is about how to reuse content and control the information-flow based on data and metadata. If information is not free, in all interpretations of the word, it is not that easy to reuse. Technical aspects may play into how free and reusable information is. However, I would bet that the biggest problem is governance – the lack of knowledge about your information and the failure to set proper requirements for metadata, and the lack of guidance / instructions on how content should be used by others later – in other contexts. We often fail to think about the long-term life cycle of content. There’s great pressure to create and publish content, but long-term content strategy is vital.
The mobile context is a common challenge these days. How do we provide the right content for the user’s device without unnecessary duplication? There are technical barriers when using HTML to markup and describe web content. Mobile websites and mobile apps require different approaches. Therefore, it is not quite as simple as letting the web content management system act, in an identical way, as the source of all the content to a mobile app. The content often needs to be a little different on a smaller screen to be of real use, and design conventions are not identical. For example, blue underlined text is not an obvious link in an app as it is on the Web.
Glossary – tags, keywords, labels
One or more individual words that describe or highlight content.
As you will have noticed, many websites supplement the usual menu-navigation with other means to find similar content. It is particularly common for websites built using the WordPress CMS (Content Management System) to have categorizations and tags, which not only partly describe content but also provides a link to a list of similar content within the same site. Tags provide a complementary navigation system and surface thematically similar content. But just as importantly, such labels also create categorization and structure.
Glossary – taxonomy
Classification for systematically grouping things according to similarity or origin. It is the exact same thing that Carl Linnaeus did in the 18th century to describe a plant’s place in nature based on its properties.
When shopping online, you often see that the menu seems to be a mix of a regular static menu and something that is driven by tags. You can often find a product in multiple listings based on manufacturer, type of product, color, size, and more. Tags can also be hidden from visitors to a website and have more of an operational internal use. I have been using hidden tags to suggest a priority level for the annual update of texts, such as ‘priority2’ to indicate that something is of secondary priority. It has worked as a sort of internal memorandum to those entrusted to review and maintain internal information. The need for a taxonomy may not be obvious, but at least tagging is self-explanatory, since it is common on the public Web. The main point of tags is to use content dynamically and to make it easier to find and reuse later on. Precise and shared understanding of what each tag means makes tags easier to select and reuse. If the exact meaning of your tags is not explicit or commonly understood, you should publish a taxonomy. It is never too late to start tagging existing content (assuming you have the rights and ability) as even archived material can be tagged without altering the original content. In some cases, you can automate the tagging based on other information that already exists – the presence of keywords in the body text, for example.
Information that cannot be reused will likely be copied to, or recreated in, a new system to be made available. Then you have at least two versions that might end up being referenced, and that ideally need to be updated when changes occur. Content choreography tries to address this problem, to make sure that valuable information is agile, versatile, and useful in all the necessary contexts and information systems. Sometimes it is easier to pinpoint the challenges if you look at examples where something has gone wrong. That is exactly what we will do now.
Examples of poor content choreography
The classic example, I think, is an information system designed by an unrepresentative minority of users, or worse, system engineers who will never use the system. Let us say that the HR department needs a new HR system. The requirements are listed, and several systems are reviewed. The winner is a system that has a feature the supplier named ‘self-service’. According to the supplier, it is a convenient entrance into the HR system through which all employees can report their worked hours, apply for leave, choose benefits, and more. The problem here is that the system is primarily designed for people who are experienced in HR matters and terminology. It has been designed to appeal to stakeholders and budget holders from within HR. It has not been designed for the workforce of field workers, store workers, factory floor workers, mobile-only sales people, and (presumably) digital savvy knowledge workers.
This focus on budget holders rather than end-users is often unavoidable, especially with enterprise software, and results in frustrated employees who waste time on an arcane system they would have avoided if they had the choice. Instead of all this, the HR department should have developed the system based on requirements from user research, and also defined how the system had to interact with other information systems. Also, considering the (usual) clunky interface, it would have been worth developing specialized interfaces in the system, allowing people to perform specific tasks via the intranet or an app without having to worry about ‘the HR system’ itself.
I faced a poor and irritating interface when I tried to edit the dates for my leave of absence. There was no ability to edit existing entries, so I was forced to delete the existing entry and start the absence request process all over again. The morning after, an angry HR advisor called me to ask why I had deleted the work schedule she had created for me. I could not see the work schedule while I was attempting to edit my original request – the system did not do anything to help. My choice would have been to initiate the request via an online form on the intranet – and amend it similarly. What happens once the form is completed should not be the individual’s concern; the system and the workflows should take care of the date, approvals, and impact.
The intranet offers HR related material that I don’t have permission to view and links to the HR system that I can’t access without having previously logged-on.
Instead of an activity-driven intranet with an underlying supporting information model, a specialized system is offered for each administrative task. You have a massive HR system, a clunky old room booking system, a claim expenses system, and a third-party benefits system. Every system requires a separate log-on, often using a different username.
Of course, this type of problem is not limited to the places I’ve worked in; it’s fairly common across organizations of every kind. The evolving digital workplace offers a multitude of systems for each task. This in itself is a bit of a problem.
For bureaucrats such as myself, it is not unusual to have several document management systems for different projects. Whenever anyone needs a document, they can never be certain where to start looking. Enterprise-wide search can help, but results can be overwhelming. Further, it’s highly likely that one or more of the older document systems will be phased out one day, or that multiple document systems will be consolidated into a new one. This all creates work and taxonomy conflicts.
When systems are rolled-out without concern for integration of workflows, people have to switch between several different systems to complete relatively simple tasks. Even the media report on administrative burdens, exclaiming that doctors don’t have time to see their patients as they are contending with poorly designed IT.
Without integration, each separate system remains ignorant of previous steps the user has taken, encumbering the employee with the need to re-enter information time and time again. Different systems often follow different input standards. For example, one system might cope well with spaces or dashes within a social security number, another will accept only spaces, another provides room for only numerals. This is just one of many things the user has to remember, when alternating between systems. The cognitive load, the expertise needed just to type into a form, is excessive. Nor can we expect to log in once, as the systems do not share permissions or user credentials.
The icing on the cake is that often the systems have different rules for password complexity too, so people not only have to have different passwords for different systems, but passwords that are constructed in idiosyncratic ways. Undoubtedly this poor user experience causes stress and even poor security behavior – as in when people write down their passwords in a list on their desktop. The solution is often simple and obvious – we should focus on the user’s experience first. Again, we need to think long-term when investigating user experience. We need to consider all the contexts that the system will be needed within, now and in the future. More than this, we have to consider the life cycle of the information that the system processes. Frankly, the information will probably out-live the system, and so portability is crucial. Structured content, and import / export capabilities are a ‘must have’. All this common sense is not yet common practice…
Good content choreography can be seen when:
- All content is described using well defined metadata.
- The system adapts to the user’s process and needs – not the other way around.
- It feels like you only have a single system.
- It is never necessary to enter data more than once.
- The information is relevant based on the recipient’s past activity, preferences, location, and other personal factors. The right information is available at the right time to satisfy a particular need.
- Information follows a given format; dates, for example, would preferably follow the international standard.
- Related information is suggested, or easy to find based on context.
Now we will explore how to control our most precious information.
Master Data Management prevents unnecessary duplication
Glossary – Master Data Management (MDM)
The systematic work to keep track of an organization’s reference information. Consider a large directory of ever-changing supplier details, customers’ order histories, or the company’s financial records.
Public websites often display information that does not originate from its own content management system. The information could be customer data, product information, calendared events, etc. This information may be collected from internal enterprise systems, like a customer relationship management system or an accounting system. For products, real-time data (including supply levels) can be fetched from a supplier’s system. For an intranet, it is common to have information about stocks, the consumer price index (adjusting for inflation in society), payroll dates, and the like. Such data, called reference data, and Master Data Management (MDM or MDM-system), is responsible for handling everything in accordance with applicable standards, regulations, and internal policies.
When you create or access a new source of reference data, ask yourself if it is acceptable to have to copy content manually into your website, or if the data source can be integrated with your existing systems.
All information has a life cycle. A fact that seems hard to remember when you want to make a quick fix – you are unlikely to foresee the systemic problems and extra work created by tactical fixes and undocumented changes.
There are advantages of manually copying information from the source into your website’s CMS. You get to control just how and when the content gets published, and you can lay out the text, images, and other media exactly how you fancy using common publishing software and well-known code like HTML and CSS5. The downside is of course that you made a copy, and that the original information will no doubt change or require revising, making the copy on your website out-of-date until you edit it in line with the fresh information. I think we have all stumbled across hysterically outdated information on the Web. As a new subscriber to Macworld magazine, I wondered when the first issue would be delivered to me; I googled it, chose the first search result and got the publication plan for the editions two years prior. I did not find a link to more up-to-date information.
Information relatively quickly becomes outdated and misleading. Moreover, often some sort of responsibility-vacuum occurs between the original owner of the information and the person who published it on the website.
The advantage of full integration between the website and data sources is that you can design the information to always be up-to-date – without ongoing efforts. Something akin to ‘create once, publish everywhere’ (COPE)6, defined by the organization National Public Radio (NPR). The downside is that it takes more effort in the short-term. In some cases it is prohibitively expensive, but over the long-term it can sometimes be the only sensible choice.
The online movie service, IMDb, is an example of thoughtful integration. For me, as a Swede, in the middle of Ed O’Neill’s description, I find the Swedish title for the series, ‘Våra värsta år’ (Lit: ‘Our Worst Years’), instead of the English title ‘Married with Children’. In other words, IMDb has created a link between their textual content (often only in English) and their master data. Embedded in their content, they bridge a small cultural barrier by assuming that different nationalities prefer localized titles.
The intranet I most frequently used was the one at Region Västra Götaland. It serves 50,000 employees with information about the multifaceted organization and tries to support everyday work. Before its deployment, when the Accounting and Human Resources departments were still decentralized, there were lots of uncoordinated local intranet pages. These local intranet sites rarely provided unique, local level information, but rather, duplicated corporate material. Payroll dates, for example; there were many and various pages stating what day of the month your salary would be paid on.
How many of these pages were updated with new dates each year, do you think? Not many, unfortunately! Intranet editors are unlikely to feel enthusiastic about manually updating multiple pages with information that’s already up-to-date in another system. When intranets fill up with duplicated and conflicting information that is clearly out of date, trust in the intranets diminishes. Without being able to trust published information, employees work harder to get and validate information from accountable people. This can be great for those who like to rely on their network of colleagues, but is inefficient and downright wasteful when considered across the whole organization.
Have you ever come across an intranet that was so well managed it always had the up-to-date information you needed? Most intranets are not getting the care they deserve. The equivalent for a public website might be to advertise a product that is no longer available anywhere in the supply-chain. If you order something and get the message that it is no longer available, your trust in the company probably diminishes.
If your website exists to make money, it is crucial to keep customers focused on buying, and make it easy for them to pay, considering how easy it is to seek out a competitor. Providing information on stock and expected delivery times encourages the customer to feel confident in their choice. You need to manage your customers’ expectations.
To return to the intranet example about payroll dates; we would preferably have gathered this reference data from a single data source, and enabled any publisher to display it without manual duplication. We would probably do this with a so-called widget, a small box that pulls data directly from elsewhere around the intranet, the Internet, or from integrated systems and displays it on the page. The news on the intranet’s home page might be displayed by a widget, as will the social stream, the ‘latest discussions’ list, and the up-coming events calendar.
Web and intranet editors probably have better things to do than to keep track of the timelines of all copies of information they’ve published over the years. If you are talking to an IT consultant, or your own IT department, they will certainly offer many ideas for robust master data management. They will surely mention terms like Enterprise Service Bus (ESB) to shuffle the information around between all involved systems. If the organization does not have other reasons to use an ESB, then it is probably smarter to use APIs7 and open data. First, we need to address the concept of metadata. Without metadata, information is less useful, useable and valuable.
The importance of marking up information with metadata
Metadata is used in almost every conceivable context. It can categorize a document’s content to let people know if something is worth reading, and it can also be the basis for website navigation or using keywords to make a web page easy to find when searching.
Some people seem to associate the term metadata with keywords, as in the words you use in queries on search engines. It is not necessarily wrong, but metadata is really all the information that summarizes, describes, or categorizes the main information. Metadata can classify the substance of a text, but can just as easily be the geographical coordinates of where a photo was taken. Metadata is some kind of descriptive labeling attached to the pertaining information. Look at, for example, ordinary price tags. They usually show the currency, a figure for the price, and the product name or description.
Metadata also tends to act as the table of contents for information. Without metadata, and the effective use of it, we cannot make the most of the information system’s potential. With good metadata, it is easy to find our way even within enormous amounts of information. If you do not curate the information with well thought-out metadata, you face the risk that the information will not be used, or reused, or contribute to anything of value. Making use of information beyond its original purpose wrings more value from it. Which makes practical sense when you consider the costs involved in its creation. Failing to reuse existing information is primarily down to how difficult it can be to find.
Metadata can store synonyms in readable content, or more abstract concepts than what is mentioned in the user-facing content, to help computers understand the meaning of the content and provide users with navigation routes. Metadata can assure you that you have found the item you were looking for. The labels, the author’s credibility, the date and the origin all contribute to your confidence in the main content.
If you were to create a new entry in an MDM (Master Data Management) system, how would you label the payroll dates? Assume integration with the payroll system cannot be achieved. An MDM system should be a role model for information management, its use of descriptive metadata and its flexibility in integration with other systems.
Metadata suggestions for you to reflect on:
- Title: Payroll dates
- Information type: Master data
- Information series: Common reference info
- Update: Ongoing
- Metadata manager: John Doe
- Target audience: All employees
- Validity: 2015–01–01 onward
- Keywords: salary, wage, payroll, pay day, payday, 2016, dates
Is anything missing? Is it possible to misinterpret the described content? It is a good idea to involve some colleagues and talk about the terminology. People tend to see things differently and associate things with different words. This is where keywords are useful. Other, authorized systems should now be able to subscribe to and fetch the actual payroll dates along with the metadata.
At the moment of content creation, it is not easy to foresee the many ways people will try to find it. Considering the overwhelming amount of information choices, we must make use of structured metadata so that the metadata itself can function as exclusion filters. Also, unstructured metadata, such as a couple of descriptive keywords, is great for those using a search function.
Are you going to let users add metadata to information they themselves did not create? To add keywords, related items, or suggest other titles, for instance? It may be a good idea as long as the contributions are separated from the creator’s content.
Advantages of allowing anyone to contribute metadata:
- The content creator may not use the same terminology as those searching or using it. Interview any expert on what they do and you will hear unfamiliar words. A greater variety of (relevant) synonyms makes the information easier to find with a search or via keyword-based navigation.
- Those who contribute to content (even in small ways) are more likely to use and value it.
- People’s’ skills may surface. The organization may find hidden talents in the organization and that some employees are more versatile than previously thought.
When content is easier to find, there’s less chance that competing near-copies will be produced. When content is well used, there’s more reason to keep it up-to-date.
One of the goals when adding collaborative elements to information systems is decentralizing the care or curation of information. When more people are able to collaborate, there’s the potential that those who truly need and use the information will organize it to suit them.
If you are lucky, there is already an established metadata specification for you to embrace. One that tells you exactly which fields must exist, which ones are required and which are optional. Think about it! A standard that makes your metadata compatible with other data sources. Did I hear an amen?
Metadata specification makes your data more standardized and interchangeable
You encounter metadata standards more frequently than you are aware of; a standard isn’t always a formal document explaining everything. Think of all the structured data you see every day that give you details or short facts. You often come across labels, perhaps with a colon and then the content itself, like the contents page of a book, or ingredient lists for meals or recipes.
Glossary – metadata specification or structural metadata
For metadata to be compatible with other data sources, you need to use a metadata specification, a standard. One such standard is Dublin Core8. It specifies the metadata structure and which data to enter.
Long before the invention of printing, written work needed organizing. The library of Alexandria was one of the first places to centralize knowledge. Egyptian royalty wanted to own copies of all written works that crossed their borders. All knowledge was considered important, and the librarians had a mandate to secure and copy scrolls and books from around the world on almost every topic.
They needed a system to organize their large and growing collection; they needed a way to describe any scroll, and logically arrange these on shelves. New works, perhaps academic texts, needed to refer to individual scrolls. This is the subject of standardized metadata; simple, clear descriptors to aid locating and understanding specific matters. Look at any book – the back cover and inside cover describe the contents of the book. The contents page lists the chapters, and the index (if available) at the back records the location of important terms or topics. Often you will find:
- The title of the book.
- Whoever wrote it and who contributed.
- Which edition you hold in your hand.
- When it was created.
- ISBN9 identification so that you can identify the book and order another copy.
- What entity published the book.
- Who holds the copyright.
This is the standardized metadata about any book, which makes it easy to identify a specific book with certainty. The advantage of a metadata standard is that everyone can understand how to use the metadata descriptors and exactly what is being described.
The simplified version of the Dublin Core standard contains fifteen values to describe a work, namely:
Have you ever read the underlying HTML code of a web page? Sometimes you will see metadata, as below, where Dublin Core (DC) is used to both classify and describe the page:
<meta name="DC.Publisher" content="Marcus Österberg" />
The point of embracing a metadata standard is that the information becomes compatible, and comparable, with other information that follows the same standard. Beyond this subject is the challenge of selecting the ‘right’ standard. You will want to adopt the same standard that other systems use, if you mean to compare your information with theirs in the future. It may seem simple in theory, but in practice there are local variations on how to describe things even when using the same standard. If the world chooses to follow a specific standard, it is good for your own sake to use the same one.
People entering information into your system must be aware of the appropriate standard, and be offered some support so that their contributions can be consistent and compliant.
When talking standards with IT vendors, they all tend to claim that they comply with all of them. It is important to try to see through the sales pitch and figure out if they mean that the system follows a de facto-standard, its own system standard, which ties you to their product, or an open standard.
Chose standards with care by trying to figure out which one is the most established for your situation. For the Web, there are thankfully great open standards for metadata in the form of common metadata tags in HTML that work with the Dublin Core standard, and even microdata, which we will talk about later. Now a look at two different choices with regard to how much freedom to give users when entering keywords. First, the orderly way with a controlled vocabulary, then, folksonomy filled with whatever users choose to write.
Glossary – controlled vocabulary
List of carefully selected words regarding a given topic. Often used as metadata to categorize other information. A word’s synonyms are frequently included, and sometimes the words are arranged in a tree structure with internal relations. People often refer to a controlled vocabulary when talking about code systems, classification, and terminology.
Glossary – ontology
Set of knowledge in a particular field. Lays out relationships between multiple vocabularies and taxonomies.
A controlled vocabulary is a pre-defined list of carefully selected words approved for use within an industry or organization. Words might be predetermined if it is imperative to be compatible with a certain metadata standard. Controlled vocabularies are used to classify information in a common and consistent way that stands the test of time and bridges the boundaries of organizations. These vocabularies are developed and maintained to encourage the use of a common language where precision in the word’s meaning is of paramount importance to avoid misunderstandings and ambiguities. In other words, we need cooperation and broad support for a vocabulary to be useful. Few words have perfectly set meanings; context is everything. Take the word cancer as an example. If you find it as a categorical keyword on a website, is it obvious what is referred to? What can seem clear, owing to assumptions and our personal knowledge, can actually be quite ambiguous.
Off the cuff, I imagine that the word cancer can be:
- The name of a group of diseases.
- A star constellation.
- A zodiac sign, which suggests approximate time of year of birth according to the pseudoscience of astrology.
- A Japanese Transformer character.
Perhaps you can offer even more suggestions. It is probably more common to mean the disease rather than the character in the Transformers – but, would your search engine figure that out?
If you get a list of all the information marked with the keyword ‘cancer’ – how do you know that everything relates to the same topic? You would not! You would be more confident if you knew that the contributors and information managers actively work with a vocabulary and strive to reduce ambiguity.
Youtube nicely separates the controlled vocabulary and the contributors’ own additional keywords when tagging video content. When entering a tag (keyword) that happens to be in Youtube’s vocabulary, you are notified in the user interface. Parentheses surround the type of data Youtube believes a keyword belongs to. Such as ‘(City / Town / Village)’ for cities. A flexible solution to make sure that the user is aware that they just entered information that can be structured and made unambiguous. This is great if the user meant the exact thing, but they have the choice to disregard the suggestion and add their plain keyword.
An example of a vocabulary is ICD–1010, which is an international method of classifying diseases and health problems. The collaboration forum for this vocabulary is the World Health Organization (WHO), which reports to the UN. ICD–10 contains over ten thousand ways to describe medical diagnoses at several levels of detail. Since the terms and their meanings have been translated into many languages, it overcomes local, idiosyncratic names and lexicons that could place barriers between healthcare professionals across the world. To build on the healthcare example there is a plethora of other, specialist vocabularies, for instance, MeSH11, which categorizes healthcare information, and HL712, which describes related health information, such as laboratory results.
As you can imagine, the field of healthcare needs several controlled vocabularies since no vocabulary is all-inclusive, and each has different strengths and limitations.
Glossary – folksonomy
A vocabulary without centralized control or standardization. Sometimes called democratic, social, or collaborative tagging.
A folksonomy is a list of words and/or concepts that are not standardized or centrally controlled by any stakeholder or organization. A folksonomy is characterized by the freedom contributors have to add new words as each individual chooses. A folksonomy can stand in marked contrast to a controlled vocabulary. A folksonomy is a highly social and democratic approach to words which may be used as keywords. New words, slang words, ordinary expressions are gathered in a creative chaos that is updated in real-time, as contributors add more tags to content. A controlled vocabulary, on the other hand, might be slow to accept new words, and might only be updated each year and published as a new version.
New terms are created more quickly than ever in this Internet age. Few people outside of South Korea would have been familiar with the word ‘Gangnam’ in 2011, but the following year, ‘Gangnam Style’ went viral around the world as people re-interpreted Psy’s K-pop dance music video. All the many content creators and curators at the time could not wait for the term ‘Gangnam Style’ to be added to a controlled vocabulary after some ruling body’s committee meeting; the benefit of, and the need for, a folksonomy is clear.
Is your information management work enhanced by the limitations imposed by a controlled vocabulary? Or are there frustrating occasions when your metadata simply cannot cope with new concepts? Would a middle ground with both a folksonomy and a controlled vocabulary work better? When designing a system, you must carefully consider the future ramifications of your decisions. What seems ideal in theory may be unworkable in practice. Think about the benefits and drawbacks of creative chaos and rigid structure; what balance will be right for your work, and the work of others?
Seth Earley, one of the authorities in the field of information architecture, suggested in a master class I attended that the need for a controlled vocabulary is clear if the content:
- Will be widely used and re-used in other contexts.
- Is already included in a controlled process.
- Answers the questions anyone may have, perhaps documented approved methods or guidelines, or comparison information.
- Has a significant cost to produce.
Examples of typical systems that probably require a controlled vocabulary would include records management, document management, and digital asset management.
A system that would fare well with a folksonomy would have the following characteristics:
- Included in a creative or ad-hoc process.
- Tends to solve problems in a practical way.
- Has a high degree of cooperation between individuals.
- The information arises whether it is wanted or not, as in a discussion forum or micro-blogs such as Twitter.
Blog platforms work well with a folksonomy, as do instant messaging systems, chat rooms, wikis, and other collaborative and communication platforms.
It is common that people have an innate sense of which of these variants they should devote themselves to, a vocabulary or folksonomy. One of my metadata-interested colleagues who works with document management in healthcare asked me a fascinating question, below:
“By the way, who decided what hashtags are allowed for use on Twitter?”
– A colleague
My colleague clearly found more value in controlled vocabularies!
On microblogs, such as Twitter, each of us can make up our own keywords to use as hashtags. The benefit of an unguided media, such as Twitter, is that we can have a hashtag for the evening to talk about just about anything without prior approval of any third parties.
You might wonder where on the control scale a web content management system is. Are you supposed to go for a folksonomy or a vocabulary? I think most people need a combination of both. But the most important thing is to find support in what wordings to use, regardless if the word is in a vocabulary or already in your folksonomy. It promotes the reuse of words within the folksonomy, helps with spelling and reduces the amount of similar keywords.
Hashtags on Twitter tend to be used for two purposes:
- Indexing, allowing topics to be contextually entitled and collated, and for people outside of the conversation to find them, e.g. #job #jobhunt #intranet #FF
- Commentary, as when people end a tweet with something like #justsayin #stayinyourlane #notmybestmoment or #FridayFeeling for example.
Like a folksonomy, many commentary hashtags are unique (e.g. #VeganKyloRen) but while they provide context, they’re no use for indexing.
Architecture using APIs and open data
Glossary – API (Application Programming Interface)
An access point to transfer information from one system to another. Sometimes an API offers some of the system’s functions to other systems. APIs provide the rich data and information needed for mobile apps, websites, and such like that use dynamic content or require interactivity.
Glossary – public API
An API that encourages third parties to use it. A public API should be documented, list support information, and have some form of commitment from the issuer that they intend to keep it alive for external users.
Glossary – PSI Act (Public Sector Information)
Law in most countries in the EU, based on an EU directive that regulates the way in which governments must share their collected information.
Glossary – open data
Philosophy that strives to share information publicly. At the very least to offer data free for reuse, preferably in a structured way for others to design services around it.
Valuable information may remain blocked within a system that cannot communicate with other systems. To realize that value, and perhaps create something new, integration and information exchange between systems is needed.
APIs, like the lubricant in a gearbox, make the online world work so well, enabling Internet services to interoperate and provide rich experiences across devices.
When planning to use APIs, you will want to consider how to integrate with very old systems, or at least systems that have had their architecture and vocabularies established, and also how your objectives might influence your choice of new systems – built to order or purchased off the shelf.
Public APIs, open data and the PSI Act
Failing to consider open data, or at least making APIs publicly available, when building an information system should be deemed a dereliction of duty. As a digitally transformed company, openness with data should be a natural part of everyday business. Most tax-funded organizations have many data sources that would make a greater difference to society just by being open, and some would most certainly also generate new revenue. Once public sector data is open, citizens themselves could improve the content, so-called crowdsourcing, to add new information or time-sensitive local exceptions when actually using the data, or observed in the physical world. It is on the verge of unethical if a tax-funded organization does not share its data sources if open data could be of value to society. The challenge is considering what is interesting enough to open up, and how. Perhaps an ‘open by default’ policy is needed.
Developers can avoid costly and unnecessary data collection by using someone else’s data sources. Commonly used data would include exchange rates, weather forecasts, public transport timetables, and product reviews. It is natural to seek open data and existing data sources when building new digital services. Governments should offer data through APIs and enable others to solve public information problems instead of building a half-assed service themselves.
A great part of any organization’s operations is to manage information. To allow multiple actors to collaborate on the same information makes the data source more credible and well-known, to the credit of someone who could have chosen to keep everything internal to the organization. Companies can make money on this, and tax-financed organizations can save money – no more excessive double-recording of information, collect it once and for all. For instance, imagine how many businesses on the Web would make use of governmental accessibility information. Suddenly, all those who would like to guide a person to a physical location would have information on the location’s accessibility for every specific variation of needs, for the blind, the language skills of the staff, whether the staff are LGBT certified, etc. Just as an example, governments have much useful data in all possible areas – but not always in a great system, unfortunately.
Other examples that are often held up to sell the sharing of information are where you try to get help from people by staging some form of competition. The example I myself heard retold atmospherically on many occasions is about the mining company Goldcorp, who announced a competition for who could find gold they themselves had difficulty finding. Goldcorp shared their geological data, and the competition was aimed at raising the profitability of an existing mine. In the pot, there was 575,000 Canadian dollars in prizes for those who participated in this digital gold rush. The initiative was made possible by Goldcorp CEO Rob McEwen, who, reportedly, had been inspired by the culture of cooperation in the early days of the Internet in the 1990s.
Those who won this competition were not geologists, they performed the job without the need to travel, and they solved the problem from, wait for it, Australia. The result of awarding 575,000 Canadian dollars was that production started about 2 years earlier at the mine for Goldcorp and the value of mined gold from these findings could be reckoned in terms of multi-billions of Canadian dollars.
The example of Goldcorp is of course unusually spectacular in its success but it is an example that others can help you with your problems if they have access to your data. In some cases, they do it for free if they and your business share a common goal your data can help realize.
The fashionable word of 2006, ‘mashup’, means to mix multiple data sources to solve a problem or create a service. To take two or more data sources as often the whole is greater than the sum of its parts. Nowadays though, people are not talking that much about mashups since it is a natural part of developing the concept of new services. But the need for access to data sources is increasing. Most common is probably when we combine our own data source with someone else’s.
If you take pictures with the Instagram app on your phone, you can choose a named geographic location to associate the image with. Instagram uses the API for the geographical data service called Foursquare and therefore avoids every complexity associated with that type of data. These two organizations have signed a cooperation agreement even though Foursquare offers their API to anyone.
The next part deals with transparency, which is important when building services, which is not only of interest to the public sector. Everyone can exercise their rights to receive data from the government.
Background to the European Union’s PSI Act
The act aims to improve the ability of citizens to play a part and participate in governmental affairs. A supposed positive side effect is that innovation and economic growth will occur when releasing information that the government has collected.
The PSI Act represents a significant improvement for many European countries that did not already have substantial governmental transparency. In some countries though, such as Sweden, many failed to see the potential in government operations, many leaned back resting assured that they already met the legal requirements decades ago. The main difference consists in the fact that Swedes have long been able to request paper copies of most of the government’s administration information. While the PSI Act initially did not expressly require digital copies of information, very little happened, while several other countries took the digital leap – from not needing to disclose any documents at all to offering them all digitally.
At least the data sources that there is a public interest in should be made open, which may seem obvious, since the curation is funded by the taxpayers. Despite this, there are exemptions that can be applied for a couple of years at a time if you, as a governmental organization, provide a service that is of public interest. Many developers would probably say that it is all in the state’s self-interest and several national organizations have stated such missions – to offer national services of public interest. For instance, governmental land and geological surveyors, company registries and statistical organizations. These organizations also happen to own some of the most interesting data sources that undoubtedly have a general interest.
A friend I am not going to name worked at Sweden’s national weather forecasting institute during the transition from safeguarding their data to sharing it freely. This friend happens to be very versatile in most aspects of technology and got upset one day. Some of his colleagues had put a lot of effort into what nevertheless turned out to be a poor visualization of the data which the institute, back then, did not share with third parties. He mumbled something about high school students being able to do a better job in less time if the institute had just released its data sources.
To open your data sources most certainly initiates a discussion on what your core operations really should be. Of course, the government should not stop providing society with information services just because they have opened their data sources. But spending time on vanity-work with information is certainly not a great idea any more, as my friend noted, since many other actors are better at it.
Some take issue with the PSI Act – cumbersome access to data
When speaking to entrepreneurs who tried to use this law for their business ideas, they get a laconic expression on their face. They often talk in terms of “the government is a paper-API”, and that “we need more lawyers compared to developers in our company”.
The PSI Act had needed something that at least regulated the means of disclosure of information. Paper should only be accepted if information is not found in any other form than on paper within the organization. Developers usually make do with structured text files, database exports, Excel, and almost anything digitally exported from existing systems.
When writing this book, there is ongoing work in the EU, made visible by the EU Commissioner Neelie Kroes, which aims to clarify the requirement to disclose information in a structured digital format. How it will work out remains to be seen, but my guess and hope, is that there will be more demand for structured data in each amendment of existing legislation.
What then is open data?
Open data is digital information with no restrictions regarding reuse, unlike the PSI data which permits limitations on reuse. Open data should therefore be free from copyright, patents and other obstacles. When the government is the publisher of such information, it is usually called open public data. Open Government Working Group has attempted to standardize what is to be considered open data.
These requirements on open data, derived from Wikipedia, are true at least in my opinion:
- Complete. Information that does not contain personal data or depends on confidentiality is made available as widely as possible. This is particularly aimed at databases with materials that could be processed and improved.
- Primary. Information shall be provided, as far as possible, as an original. Image and video materials will be provided in the highest possible resolution to allow for further processing.
- Timely. Information should be made available as quickly as possible so that the value of it is not lost. There should be mechanisms to receive information about updates automatically.
- Accessible. Information made available to as many users as possible for as many purposes as possible.
- Machine processable. The information is structured in a way that allows for machine processing and interconnection with other registers.
- Non-discriminatory. The information is available to all without requiring payment, or restrictions in the form of licensing and registration procedures.
- In an open format. The format the information is provided in adopts an open standard, or that the documentation for the format is freely available and free from copyright licensing terms.
- License-free. The data itself should be free of any limitations or costs of use. For instance, if the data is released under Creative Commons CC0 or in the public domain, it is considered license-free.
Some points are more open to interpretation than others. Perhaps principally point seven, which can be anything from a simple text-file to an advanced API for distribution and synchronization of information. If you mix open data with an API, which many believe is an obvious combination, the possibility arises for others to build services that depend on the information gathered and upon the API. It’s not really a requirement of open data that it is offered through a public API, but if you want to encourage its use, it is worth checking what developers want. The exchange of data needs to be reliable, flexible and give developers the confidence to use it. Developers may not always give priority to making money on the things they build but they quickly learn to avoid the frustration that occurs with seemingly unnecessary obstacles, terms and other hassles.
The benefits of an API for a startup business or when building anew
It is perhaps not obvious at first glance, but the business model for APIs is that not everyone should have to reinvent the wheel themselves. Today’s information systems are so complex to develop that any help people can get from others is gratefully accepted. For a startup, it is essential to avoid doing the things you are not good at. That is a prerequisite not to fail, and probably in many cases, you depend on public APIs for services someone else can provide better than you can. The list can be very long with the information and services they depend on to be effective. Things like currency exchange rates for international companies, geographical data as in the Instagram / Foursquare example, the current weather for a location or support functions to reduce the amount of spam comments on your blog, among others, are examples of things someone else probably does better than you, and perhaps they already offer an API.
I thought that a list of arguments would be nice, a list of why a startup, or a new web service, can benefit from letting their own public API be a natural part of their business.
1. It is normal business
In today’s connected society, you do not know beforehand where your information will finally end up, so it is difficult to do without APIs. For your own business needs, it should not come as a surprise that APIs help when your web, intranet, mobile app and all other systems need access to the same data sources. When discussing a partnership with another company, you are at least partly ready for integration across organizational boundaries.
Is there a problem many face that you can solve with the help of a computer? With an API, you can offer your services to the world directly from your parent’s basement, or your own garage 🙂
APIs are today’s information desks, automated secretaries and staff all rolled into one. Note that an API is not automatically private just because it is not public. When you have an API for your mobile app, others may also have access to it and use it, though a public API is a better start to a relationship with other developers than them using your hidden API is.
2. Builds relationships around your services
Think of it as an ecosystem where your services are the centerpiece from day one. There may, at first, be just a few who are interested in your services, but for every new user, you have someone whose success partly relates to your success. That makes for natural communication and offer exchanges in the future. External developers contributing with new perspectives, innovation and expertise towards something that all participants will benefit from.
What if someone who uses your API makes it big time? It will partly spill over to your services and it could lead to unexpected business opportunities.
3. Release the data and contribute to transparency
Probably many wish that their employer’s data sources were more accessible to employees. I am definitely one of those with a need for more transparency in what data my employer has. Think about the term data discovery for a moment. How easy is it to explore the organization’s digital information resources? Often, internal inventiveness around existing data sources is unfortunately thwarted since creative people do not have access to the data they actually need to do their jobs better or in a more efficient way.
Just with an API, it becomes a silo of information. The difference is that you know which silos there are, which you can make use of and you can use the content as necessary. Someone has put effort into gathering information so it is a good idea to take advantage of it and encourage reuse.
4. Investors take this almost for granted
How big do you think Twitter would have been if they had never offered users’ data to third-parties? Because of this, there was a broad range of applications for all possible platforms, which contributed to Twitter’s popularity. Other developers drew a lot of experience which Twitter could use themselves and capitalize on later on.
A big part of succeeding online is to enlist the help of others and make the most of things as quickly as is humanly possible. An API is often a part of that strategy.
5. Makes for good mashups
Google Maps is a gigantic example of a widely used mashup. The free use of Google Maps on other company websites or apps helped Google to establish themselves in the map business, but if the users’ services become popular, they will have to start paying to use them.
The possibilities of mashups are bigger than you might think with all of the niche interests that thrive on the Web, along with all the services that help with features like video and more that are difficult to create yourself.
6. Self-going content marketing
Spirits brand Absolut did an interesting marketing stunt in 2013 when they hosted an innovation competition centered on their new API. The API contains many ingredients for cocktails where Absolut’s own vodka happens to be an ingredient.
If the API becomes popular or the drink recipes are used on someone else’s website, Absolut gets extremely valuable credibility from someone who is seemingly independent. In addition to all the text in the recipes, professionally taken photographs and video clips are offered free of charge to use via their API. Users and developers did not have to think long about the quality of the content.
Design a public API with the developers’ experience in mind
It is a good idea to start all IT projects by planning what data to handle. Then you have a plan for the construction of an API for your own needs. It is advisable to use this API yourself – eat your own dog food – in the same way as if you were an external party. If you do not dare to use your own API, one may ask why someone else should be more reckless than you are.
Just as we have for a long time distinguished between content and design in web development by using CSS and HTML, in the same way we should distinguish between data and presentation by using an API to feed the data a webpage needs. If you are starting a new web project, begin by looking at what data you already have, what data you need, how you will collect the needed data and which parts are meaningful to offer to third parties.
When releasing a public API, you really should commit to some basic things. Even though you may only indirectly make money on the API, the relationship between you as the publisher of an API and your users is similar to a business relationship, and you should regard it as such.
Friendly terms and a free license
If you want someone to use your API, it requires good communication and mutual interest between all involved. Avoid being too bureaucratic or legal in the conditions of use. Instead, try to be encouraging and inspiring concerning what they can build using the API.
I think it is worth emphasizing that you should avoid burdensome terms, and, offer as free a license as possible. This is especially true for the public sector, but the same reasoning however is useful in the corporate context. Be mindful of anything in the terms that is unnecessarily harsh, or perhaps indicates that transparency is given reluctantly.
I have met with an example of reluctance, but not in open data, in the form of a former monopolist in the event tickets market. My company had Northern Europe’s largest festival website, and I asked them if I could get permission to use their APIs to channel my visitors to the correct part of their website. For my visitors to more easily buy tickets, for me to avoid manual linking, and for the other company to make money. It was a bit confusing when they wanted money from me to hand over prospective customers to them. There I was, with hundreds of thousands of visitors reading about the concerts and festivals but could not easily guide them to where they could buy the tickets. It would have been more reasonable that they asked how much I wanted to get paid for handing over customers, or replied that they could not help me because of some technical reason. Need I mention that nowadays, there are many companies in this market space? There are, of course, a number of APIs to use.
- Is there any pay model? How many requests to the service are free of charge and when exactly does the service start to cost money?
- The basic license for information purposes. If there are limitations in the general license, which license applies when, and what are these limitations?
- If the information is not completely free – how can it be temporarily stored? It is an advantage if the terms, in plain language, can tell how long a re-user, for performance reasons or otherwise, can keep data downloaded in their own system.
- Are there one or more usage quotas? Usually there are a limited number of requests allowed to be made to an API, and this is very important to find out early on for developers so that they can conserve resources.
No surprising the developers with unforeseen breaking changes
It is easy to think that you thought of everything, but almost every successful technical project requires future adjustments. When it comes to APIs, it is important to plan for this right from the start by designing for future additions. This means that you will version-manage APIs regardless of whether you envisage future versions.
Many developers use the version number in the addresses, such as /v1/ or /version-1/, which is used for sending requests to the API. This makes it easy to see which API version the code is using. In addition, you do not have to worry about clashes between different versions, as they always have a unique version number. To have the version number in the address is standard practice among major APIs, perhaps because the other choice, to add the version number in the HTTP header, is more obscure and many developers do not even know about this.
The documentation must also be versioned so documentation is still there for those who are not using the latest version of the API. It is advisable to provide information on what differentiates the various versions from a technical perspective, but also the direction in which the API is developing conceptually.
In practice, it must be possible to use an older version of an API for a transitional period if others are using your API. Good practice is to contact them with information about what the changes are, and how long they will be able to stay on the older version. Actually, you may not always turn off access to older versions, but it is good to be open about your plans to provide continued support for versions other than the most recent. An absolute minimum is to give users of an API at least three months in which to manage their migration to a newer version. Expect to upset some users if you set too tight a deadline.
Because of future needs for change, and to get to know the API users, it may be a good idea to require, or offer, registration. At least for those who are continuous users or use the API for business purposes. The benefit to them is to be able to receive information on improvements, advance warning of new versions, and also a way to get in contact for support questions.
Since API usage for a user may vary heavily between days, it is a nice gesture to offer a soft quota with prior notice before the hard quota strikes causing lockout. An example would be sending out an e-mail when 75 % of the quota is used up. Offering your API users the choice of when this warning occurs would be a great plus point. Should you buy an off-the-shelf system to offer APIs on a larger scale, definitively ask about customized quotas.
Try as far as you can to put yourself in the user’s shoes and it will probably end up fine. Eat your own dog food that is.
Provide data in the expected format and in suitable bundles
There are two opposing approaches to data exchange that you need to reflect upon, namely, providing data refined enough for direct use in other applications, or to provide data in its original format. To exemplify the refined version, an API responds with true or false if a certain bus is on schedule or not. The more original format would be to show all data on all buses, like a database copy.
Processed data is of course handy but simultaneously limits what we can do with the information. Getting a database copy is great for those who need to do just about anything other than the most obvious, but at the same time, the data set timeliness quickly begins to disappear. Should you yourself use the API, it is a bit easier to know what the first version might look like. But if others are to gain access, you should talk to them, what information to deliver for them to have a great experience. The risk is that you, with the best of intentions, compile data in a way that makes it impossible for others to take advantage of your API. Examples I have seen are services converting information into readable formats such as Word documents, while this really only leads to more work for developers to convert everything back to plain text.
It is a good idea to offer the original format in bite-sized packages and that there are ways to keep track of data that is updated regularly. To continue with the example of buses, it would be that the packaging allows retrieval of all the info about a certain bus-line’s planned schedule. Then add to this with a simple API service that specifies how the current situation of the bus-line relates to the schedule.
Which format is best? Well, that depends on whom you ask. If you ask a non-developer, they probably think about how to access the information on a familiar type of device. The answer is certainly something that is familiar to most of us, possibly something from the Microsoft Office suite, Adobe Photoshop, PDF and the like. If you ask a developer on the other hand, they often think of the versatility of the format and you will get answers that are abbreviations, like JSONP or XML, or that they want everything you have in whatever the native format is.
The point is you have to know who your users are and whether you already know what they need. Otherwise, it is time to get hold of some representative users.
As a rule of thumb, you can think about which format is useful to quickly and easily explore what the data source contains. Complement it with the formats used in other applications. It is common practice to offer several formats, so it will be up to the user to choose for themselves. For example, it is easier for most people to get an Excel file when, on a single occasion, they need to look through the content while you’d rather have a CSV file, i.e. a comma delimited text file, or similar if the content is to be processed in an application.
Also keep in mind in what situation your users are in – do they need to download the content as a file or is it to be used with a mobile app? It can be both, but by your choice of path, you control which one users get. To offer the very popular format JSONP is a good start for attracting web developers, who will recognize it and be able to use your API with their own data source without necessarily having to incur the inconvenience of making a local copy first.
The type of data can sometimes control the format. Is it financial information it could be a spreadsheet, supplemented with a CSV file, is most logical. Is it news or chronological information probably the format RSS or ATOM is most suitable. Is it about geographic information or map data, perhaps GeoRSS, Shape files or KML files is what you are looking for.
Many times, it is stated in the address of the API request which format the response is. The reasoning is the same as with expressing the version number in addresses – it is easier to understand your code if it is clearly indicated what the response contains. It might look something like the address below, the API request to find out who won the Nobel Peace Prize in 2015 – in this case JSON is the format:
Error handling and dimensioning of the service
A good public API needs to be predictable. You need to know in advance how an error will appear – technically. In some APIs I have seen, you can even instruct the API to provide an erroneous response to test-drive your application for this eventuality. An API is, as you probably have figured out by now, infrastructure others use and it should not be treated lightly regarding reliability compared to other major web services you offer.
It can have enormous consequences for a company’s reputation if the API does not work. An example is Facebook’s bug, in the beginning of 2013, which affected almost all the sites that used Facebook’s Like-button. Large parts of the Web were not accessible at all. Well-known sites such as CNN, Huffington Post, ESPN, The Washington Post and many more went offline, for not having fault-tolerance regarding Facebook.
For those who use an API, to have a standing chance to carry out good error handling themselves, it requires the API publishers to offer error handling. This includes everything from the obvious things like using the correct status codes in HTTP. You have probably seen error 404-pages when surfing the web; those kind of error messages are helpful and developer friendly to find out what happened.
The biggest failure is probably when the entire API goes down because of an overload. It can be caused by many things, most of which are familiar to all developers. An interesting anecdote is about the company that manages the mass-transit system in Stockholm, Sweden. When they re-launched their website, they thought they were the target of a massive denial of service attack. In fact, it turned out that the earlier website had an internal, they thought, API to access traffic information. The API was not documented or advertised as a service for others to use. But that did not stop many popular services and mobile apps from using a direct integration to the API, which had now disappeared.
The reason that the new website went down was probably due to all the erroneous requests to the website where the API once resided, which can be more resource-consuming than when everything worked as intended. More on performance planning later in this book. The solution was to roll back to the old website, work together with all those who needed a public API and then publish the new website again.
An ingenious solution, many probably overlooked, is to design the API so you can prioritize traffic in difficult situations. If in the sticky situation of having to choke usage, perhaps the API should only serve the API issuer and paying customers.
Other problems that can arise are that you have not optimized the API for low resource use of related resources, such as databases, or it may be that capacity simply runs out because of an overwhelming popularity beyond your imagination.
Nowadays, it is so cheap to rent space at a major hosting provider that you should abandon the tiny hosts, at least for APIs, and have a healthy margin on the resources that the API depends on. The same may well apply to the services the in-house IT department hosts, regardless of whether it is a larger organization – maybe a hosting service can offer a similar service.
Best practice is to have a subdomain like api.mywebsite.com or data.mywebsite.com, which allows you to put the API somewhere that prioritizes performance, and scalability, without automatically affecting your website’s costs or settings.
Provide code samples and showcase success stories
Those who want to use your API are not necessarily experienced system developers that have plenty of time on their hands. Therefore, what others can reuse should be included in the API documentation. For example, the API documentation should give tips on how to get started, code samples, or even more or less complete sample applications to download and use as a template. Remember, the goal of having a public API is for others to use it; otherwise, it is smarter to keep it entirely private.
Be open to link to resources that may be useful, tips on tools to crunch the data and to encourage those who use the API. Even if they do not pay for their use, it is something that justifies why the API is made public at least.
An easy approach some embraced is to have a designated contact person who supports API users, mainly by having a developer blog for news and a wiki for documentation. There you can get information, respond to comments and enhance documentation. This is where you build relationships with API users. A task that requires technical skills, social tact, as well as a touch of market-based thinking.
Promote via data markets and API directories
To reach out with your API, make use of various API directories and get listed. Internationally, programmableweb.com is by far the largest, with tens of thousands of listed APIs, but do not miss national or local directories.
Some directories require open license terms for inclusion while others are more of a data market where you can make money by selling your data through the service. On these services, you can see what competition there is in the field your API covers, which can be really good inspiration for what to offer in the next version or if collaboration with another organization is meaningful.
What is the quality of data needed?
Glossary – URI (Uniform Resource Identifier)
An address to an Internet resource and looks like an ordinary URL. In the context of linked data, URIs are intended to give an address, via the Internet, to a thing, which can be to name something in the physical world. You can view a URI as a name for something – a name that happens to be an address to the description of the thing.
Glossary – linked data
Data that is compatible with other data and usually contains relational links, URIs that is, to these resources. Readable, processable and understandable to machines through an information model that is self-descriptive.
Linked data is refined data so well standardized that it can be combined with other data sources across the Web. The challenge linked data is trying to solve is to bring order, structure and context to the increasing amount of information we are trying to make use of – popularly known as big data. Linked data is a way to know if the information we stumble upon is related to something else. Whether you can do further research. Just as the Web is a network of documents, linked data enhances the Web with an ever-growing network of data. Data combined into facts and knowledge is the natural continuation of the Web.
The Web’s creator, Sir Tim Berners-Lee, has set up four principles for how to link data, namely that:
- Using URIs to name and identify things. The ‘thing’ may be a document on the Web, a dataset or a physical location such as a tram stop. This provides unique naming and a common way to refer to things, whether they are online or in the physical world.
- Using HTTP web-protocol so that URIs can be looked up, read by people and processed by machines. In other words, there should be a page on the Web where you can read information about a URI, whether you are a human or machine, enter via a desktop computer or other type of device.
- Provide useful information when the URI is looked up, by use of standardized formats such as RDF, XML and SPARQL. Information such as the status of the thing, who is responsible for it, metadata such as keywords, etc. If a machine makes the request, the answer comes in a language machines can process, probably RDF. If it is a human using a web browser, you can expect readable information meant to address humans.
- Refer to other resources using their URIs and thereby reference related URIs on the Web. Things’ relatedness to other information sources or datasets is of high importance. Among others, a URI can declare that the old version of itself is found at the referencing URIs endpoint. All known and useful references are given. The list can be quite long since both the organization’s own internal and external URIs can be plentiful.
The BBC built their Iplayer Radio service based on linked data as a data source. It enabled a lovely Wikipedia touch with cross-linking within the service. Information related to what is currently playing is automatically pulled from external linked data sources. This integration of external sources is perhaps less tightly controlled compared to traditional use of external APIs. Linked data uses many techniques that allow a loosely coupled combination of several data sources’ content. The vision is that the Web will become like one big database accessible to all.
For it to be worthwhile to contribute with your own linked data, it should at least be able to relate to some other information silo. Or perhaps you are the natural issuer of URIs for something unique – for instance, a municipality, naming your properties, such as schools.
We can benefit from the principles of linked data without intending to release the information outside an organization. For example, to introduce enterprise-wide naming of important things would be something an enterprise architect’s only dream about. As social security numbers are understandable points of reference for interoperability between systems, URIs too can offer the same standardization benefits – instead of having things named differently in every database.
Microdata – semantically defined content
Glossary – semantic web
Refers to websites where the content and type of information is understandable and processable by machines. This enables the Web to become more of an enormous database. A network of data and not just a network of documents.
Mention the word semantically and your colleagues instantly get a glassy look in their eyes, believe me I have tried many times. It is not as complicated or boring as the word seems to suggest. The semantic web, or Web 3.0 as some call it, is the generation of the Web we see today. A more intelligent and relevant web. It weighs in your location in the relevance model for which gas station might suit you – the ones that are close and have many positive reviews rank higher than the completely unknown ones without public contact information located pretty damn far away from your current location.
For this to work, information needs to be clearly defined since no one has access to all information in a structured format. The complete picture is spread across many services that need to interoperate. Among other things, search engines are at the mercy of how well information is described on websites (and other sources of data) they themselves have no control over. Here microdata introduces itself as a savior for how your website can be a part of the semantic web, how your data can be self-explanatory to machines.
When using a search engine in recent years, it is likely that you have already taken advantage of semantic features. Search engines and other technical systems try to understand the structure of the information contained in the content. This is easier for humans than for machines and we can understand what a text is about just by reading it. The possibilities for search engines to improve their understanding of unstructured information have its limits, not to mention how to understand the content of a video or other media types.
At the most basic level, probably everyone has realized at this point the need to distinguish between headers and other content in text. It is nowadays, fortunately, not that common to see sub-headers which are just bolded text or images for headers instead of text. Just as headers make themselves noticeable for us visually when we skim through a text, they are also there to give structure to a document, a structure that improves its readability to machines. The very same thing that makes a machine understand that a particular text is a header is what lets the blind skim a text, by listening to the headers before choosing to load any portion of it to be read aloud.
Another context where headers are used is to measure the relevance of a text. If a search term is contained in a header on the page, it is probably more relevant than another page in which the word is only found in the unstructured body. This is used in most of today’s search engines.
To identify the types of information your website contains, and mark them properly, cannot be considered optional work anymore. List things like contacts, calendar events, geographical locations, etc. and find a suitable standard that describes the information. Some of the standards we will go through shortly.
So, what is the problem?
Reflect upon how many varieties of dates you have encountered. Most likely, you can find a version in your calendar, another in your e-mail and a plethora on the products at the supermarket. I get grouchy to say the least whenever the date is given in the style of 06/11/10 since I do not know which standard it follows.
To figure out what 29 stands for on the page of the month of August in your calendar is easy for us humans. When on a website, if we all click on the calendar and are presented with the same information, most of us will probably understand this in that context too.
However, it is not obvious to a machine to understand the context. Add to this that there are many different national standards and industry standards that specify how dates should be formatted. A week number in a date context is another problem; only a couple of European countries and a few in Asia have a grasp of the concept that weeks may have numbers.
This problem is not limited of course to data describing dates or time. Among many examples worth mentioning are distance, geographic location and units for measuring weight. In fact, the problem is in all information.
The potential of semantic information
One of several dreams that remain to be realized on the Internet is that the Web should act as a giant structured database so that we can get precise answers to almost any question. Right now, the Web is a half-structured database where it is difficult to know what is what, at least for machines.
Those who are interested in search engine optimization have probably encountered SEO best practices with structured data by now, or read tips about working with something called RDFa. Structured data is enriched snippets of information that are self-descriptive to a machine. The goal is to be more precise in the nature of your content and that Google will reuse the data which gives a competitive edge to your website.
Besides the possibility of gaining more space for your website on the search engine’s results-page, it also demonstrates to the user that there is more related or ancillary information on the website. So it is not just for search engines’ sakes that you mark up your content. Examples of other uses are what you, as a visitor on a website, can do with the information. Such as being able to click on a phone number on a website and make a phone call, import contacts to your address book and add an event to your calendar directly from a website – since your browser understood the content.
That users themselves can take advantage of semantically marked up information has not exactly taken most users by storm. Now there is a chance that this will change as the Web is increasingly used by other types of devices where the lack of a keyboard and mouse can be alleviated with these opportunities, and information can be used in a more intuitive way adapted for each type of device.
To regard the Web as a distributed database, or a global document management system, is perhaps not so strange. What semantic technology adds is to put the document’s type on each document, or subset within a page. If a web page contains calendar information and a geographic location, it helps other systems to give the user a choice of filtering within a larger amount of information. The content can declare what it is.
Microdata standards such as Schema.org and Microformats
Microformats and Schema.org are the two most common ways to extend the semantics of web pages today. They both consist of a number of specifications for how HTML code should look to stand out from the body text and other HTML elements. Microformats were released early as a standard under continuous development, with the idea of offering simple solutions to develop the HTML standard to extend semantic meaning; for instance, including contact information, among other things.
It involves adjusting the HTML code to follow certain patterns (see examples later on) where the code goes to show information in a certain format and, at the same time, have a more reader friendly presentation appear on the page. This technology is called RDFa (Resource Description Framework in Attributes), which you can see in the example code is a method of enriching code with content type.
Examples of code in Microformats to describe a person’s contact information.
<li class=”fn”>Jane Doe</li><li class=”org”>Acme Inc</li>
<li class=”tel”>555-12 34 56</li>
<li><a class=”url” href=”http://example.com/”>http://example.com/</a></li>
Geographic location in Northern Sweden marked up with Schema.org
<div itemscope itemtype=”http://schema.org/Place”>
<span itemprop=”name”>Bräntberget Ski lifts, Umeå</span>
<div itemprop=”geo” itemscope itemtype=”http://schema.org/GeoCoordinates”>
<meta itemprop=”latitude” content=”63.841066″ />
<meta itemprop=”longitude” content=”20.311139″ />
Schema.org is an industry standard by Google, Yahoo and Bing, which began in the summer of 2011. It is a joint effort to remedy the fact that Microformats development has slowed down significantly since 2005. Schema.org is your primary choice if you have not yet begun with microdata for your information.
There are many circumstances where microdata can be used to enrich information. Here is a short list to exemplify the scope:
- Contact information and authorship.
- Geographical locations.
- People and organizations.
- Health data and medical procedures.
- Products, offers, reviews.
- Books, movies, recipes, paintings.
Most of these can be combined to give a geographic position in a company’s contact information, for instance. The full list of entities is quite long13.
What this structured data is then used for varies from service to service. How it looks on Google is something we all notice fast and is quickly included in the best practices of search engine optimization, not only to lure visitors from the search engine but also to increase the page value in the search engine’s algorithm. Better structure is a qualitative measure many of us can improve. There is nothing to prevent other actors from taking advantage of the same microdata in their own services – all this microdata is of course just as public for Google and anyone else, such as an organization’s own enterprise search engine.
Digital Asset Management (and Adaptive Content)
Glossary – Digital Asset Management
The structured work to collect, describe, keep and use digital resources in a usable archive. Sometimes known as an image bank, but usually contains more than just pictures. Often called DAM (Digital Asset Management) or MAM (Media Asset Management).
Glossary – Adaptive Content Management
In essence the same thing as DAM, but with a focus on multi-channel challenges, such as serving different versions of material, depending on the type of unit or device it is to be consumed on, including mobile phones, desktop computers, wearables, televisions in shop windows and so on.
The benefits of Digital Asset Management (hereafter called DAM and including Adaptive Content) are primarily two-fold, namely, first, to have a central location for storing and managing media files for repeated use. In addition to the need for orderliness, to have a place to look into, or integrate other systems with, the DAM system is often responsible for the distribution of material to other information systems. It is easier, in an enterprise scenario, to adopt a single system that is skilled in optimizing images for the Web than have all web systems do this, or to stream video suitable for the available bandwidth of the receiver. In large organizations, it is common to have many systems that are accessible through the Web. Many of them, though, have great problems in living up to the Web’s fast changing needs.
DAM described on Wikipedia14:
“Digital Asset Management (DAM) consists of management tasks and decisions surrounding the ingestion, annotation, cataloging, storage, retrieval and distribution of digital assets. Digital photographs, animations, videos and music exemplify the target areas of media asset management (a subcategory of DAM).”
A DAM system may not be a perfect fit for you if you run a personal WordPress website. Nevertheless, to regard all media files, such as images, audio, and others, as digital resources may be worthwhile anyway. The day may come when we ask why we did not have more foresight regarding file structure.
The similarities with document management for large organizations are striking. It is about preserving the original files, standardizing metadata that is attached to each file, controlling who has access to what files and offering accessible copies via websites. Exemplified by a photograph, the original is a digital raw file taken directly from a camera; the metadata describes the photo’s properties such as the aperture and its shutter-speed, access rights, license for the image and so on. The viewable copy via the Web is a JPG image optimized for viewing on the Web.
Examples of files that can be part of a DAM system:
- Illustrations, information graphics and images.
- Brochures, graphic productions and originals of other printed matter.
- Videos and movies.
- Sound clips, podcasts and audio effects.
- 3D-printer files and other digital drawings.
Usually you leave out the most obvious office documents, like word processing and spreadsheets. Exactly where to draw the line can be quite difficult and I have seen examples of DAM systems that also stored Excel files. Perhaps because the DAM systems did a better job than the document management systems.
The most basic editing capabilities are often included in a DAM system, for example, to crop an image. This edited version of the material is stored as a copy of the original so it is possible to monitor the use of an image.
Examples of factors that suggest the need for a DAM system – instead of using the upload directory on your website – can be:
- Access control for serving logged-in users a certain image while others see a very low-resolution version, thus encouraging registration on the website.
- Special access to give freelance photographers and print shops access to project files, so you do not have to send physical stuff such as USB sticks or DVDs.
- Management of multiple channels where a DAM system makes it easier to have an overview of communication across the ever-growing number of devices and channels.
- Personalization such as where visitors get content which is popular in their geographic vicinity, or videos are automatically subtitled in a language they understand.
- Device customization by sending materials in high DPI to devices that support this, or to send the best suited format for the device.
- Connectivity customization to send, for example, streamed video tailored not only to the recipient’s device but also in the best possible resolution for the bandwidth available.
- Context factors where, if a user’s device is low on battery, you limit the amount of network traffic, or to adapt the content’s contrast to the lighting conditions around the user.
- Target audience customization so that information sent is comprehensible to the recipient based on their level of knowledge; this is solved by relating different editions of content classified by target audience.
- Legal requirements such as when using images from an external stock photo provider and the need to keep track of which possibly time-limited publication licenses these images have.
- Findability since it is easier finding content if it is stored in a single location, probably with more consistent metadata than if managed in lots of systems.
- Support marketing to carry out A / B testing and ensuring that the system, after a completed A / B test, sends the best performing material to end-users.
- Analysis of content usage is something more people should pay attention to. Do you know how many times the files in your upload directory were downloaded? Perhaps it is your most popular content. Is there deep linking to it from other websites, to content you cannot find in your analytics tool? If it is all there in one place, it is easier to track.
Holy shit, you might think. All of that, yet nothing that concerns you with your little website. The snag is just that the need for this type of orderly handling is sneaking up on you over time. It’s never an attractive idea to stop producing content and reflect on inconvenient shortcomings needing massive work to overcome.
Imagine that you have a super profile photo of your CEO, such a photo may end up in/on these and other places:
- Your own website for contact info.
- On your intranet.
- In the human resources’ system and the CEO’s corporate ID card.
- On the mobile website, in the mobile app and tablet app.
- In printed matter, such as the company’s annual report.
- In e-mail newsletters.
- On internal and external blogs.
- On Micro-blogs, like a Twitter profile picture.
- In social media on official accounts.
Losing the photo’s original makes life tricky, just with a few of the above scenarios, when all you really needed was to crop the photo for some new hyped online service. All official material that is used needs to be in a common location searchable by anyone who needs to communicate. Having a DAM system available to everyone in the organization, and not under the total control of an old-fashioned print-oriented communication department is therefore a goal to aim for.
The challenge with a DAM system is that it does not get better than its content, or its structure. Who should be allowed to contribute content? Anyone or only a qualified media editor? Personally, I would probably push you to go for a mix of professional editors with just about anybody. Maybe two different zones. The ‘demilitarized zone’ where the organization keeps enough control to dare to invite media and partners – probably press-photos and other resources for public relations. Then there is the place where various levels of chaos thrive. Like Dante’s concept of circles in hell, you might plan for distinct levels of structure in user-generated content. This is managed through thresholds set by the organization and those who contribute content can choose the level of structure/findability for their content. If the user intends or allows others to reuse the content, then the DAM system guides the user to enter the right amount of metadata and to set permissions correctly. With an encouraging interface, perhaps using game mechanisms, guiding the user through the process to give enough information so that there is a chance that someone else can find it later on. Another choice is to require a certain minimum number of keywords and categories for the user to be able to add content to the DAM.
Keywords may need to be given a graphical representation rather than assuming that users understand the need to type a comma between each keyword. A keyword suggestion service is a good idea to create some kind of uniformity in keywords.
The information source that provides keyword suggestions may be a centralized metadata service that keeps the vocabulary used and curates your own folksonomy. Such a metadata service would of course also be used by all other information systems. It is important that the metadata service does not only suggest words, but also attributes unique identifiers to every word. The words may appear in several places in a vocabulary that has a tree structure, or if we have a number of similar glossaries, you need to keep the words apart.
A media label with a keyword needs to keep track of both the word as plain text but also the unique ID of the keyword as a reference to the metadata service. Then it will be easier in the future to move a referenced word from the folksonomy to a more orderly customized vocabulary as you do not have to change the metadata on all content previously tagged with the word from the folksonomy. It would also help if you decide to go through the folksonomy in search of synonyms you want to associate with each other.
Also keep in mind that there is other metadata that needs to be entered for the information to be useful. In all honesty, it is probably unlikely that the magical day comes when we have time to fix it all afterwards. Video and audio clips may need time codes for when various sections occur inside the clip. It is their equivalent of texts’ so obvious headers; the chapters help us skim through the content. On Youtube, you can, for example, right click on the video and get a specific link to the exact position of the clip, but you can also comment on video clips and write a time-reference in the format of minutes:seconds, and a link is created that refers to a given position in the video. This is the equivalent of a chapter in an audio book or movie. If this information is entered in the central DAM system, it can offer an automatic table of contents for podcasts, for instance.
Those who frequently use podcast apps may have noticed that the cover image in a podcast sometimes changes by chapter, which is suitable for, among others, DJs who want the original artist covers on the image to the chapter.
There are metadata standards for many specific types of files. A photograph contains EXIF data with information about the shutter-speed and other camera settings, which can be useful. ID3 is a tagging system for audio and music that can be worth using. Mainly because it is embedded in the media-file and therefore will follow the file wherever it might end up.
Adaptive content, as in versatile content, appeared as a concept around 2012 and is a subset of what many probably would expect from a fully featured DAM strategy. It addresses the challenges of keeping control of content regardless of which channel it is communicated through, though most seem to be concerned with mobiles and wearables. The goal is, like the idea of responsive web design, that the content is adapted to the actual context.
“Think of your core content as a fluid that gets poured into a huge number of containers. Get your content ready to go anywhere because it’s going to go everywhere.”
– Brad Frost15
If you make a video, it is good if it is sent in an optimized format suited to the respective channel’s ability to deliver results. Sometimes you stumble across examples of when things are not as intended. Among other things, I noted that a local tech-shop franchise ran a campaign in their storefront where the commercial was made for landscape mode even though the screen was mounted in portrait. At first, I did not understand what made it look weird, but as soon as I saw the odd shape of a wheel, I got it. The proportions in the film suffer heavily due to the missing black bar above and below the picture.
In your own storefront, you will probably not have problems with bandwidth and can show the best possible resolution and without undue compression. The same videos, when presented online, need lots of compression – especially if they are to be streamed since you risk getting lag if the bandwidth is not good enough. It might be easier to think of this in terms of still images when in an ad context, as in Google Adwords and suchlike, where you have a number of standardized image sizes for everyone to use. We should apply the same concept when storing files in a DAM system. It is not just about fulfilling content needs and opportunities but also about the receiving devices. The difficult balance is to avoid creating lots of optimized content for each recipient and instead trying to cover as many scenarios as possible.
“Fragmenting our content across different ‘device-optimized’ experiences is a losing proposition, or at least an unsustainable one.”
– Ethan Marcotte, author of the book Responsive Web Design
It is not only the proportions, or the size of pictures, that are challenging. The list of challenges can be quite long, depending on what you consider included in the DAM systems responsibilities for multi-channel communication. Personally, I would add that this system, regardless of what any other of your content systems are capable of, should be able to send context-aware content so that the consuming devices’ needs are met.
Also, list all the necessary features you need, for instance if:
- Pictures are sent with normal or, maybe high resolution too (what Apple users call retina).
- Video content is to be streamed and / or downloaded.
- Resolution on the material is to be varied depending on system-external circumstances such as what the consuming devices can handle in each case.
- Compression ratio should automatically be adjusted and also selected manually. Do you need compression to be optional for files created with professional tools where the uploader has already sorted out the compression issue?
- Format may depend on the receiver. Text may come as Word, HTML, PDF, Markdown, and others.
What will you send an Iphone 6S with its retina display locked in landscape mode when on a medium fast cellular connection and located outdoors at noon?
It is similar to matchmaking. Would you prefer to reuse an old advertisement in HTML5 format? Can the device requesting the content even display HTML5, or should a list of other options be given, such as sending a picture instead, or nothing at all? Sometimes it is pointless to try to send something and you should ask the recipient for an e-mail address to send the content. The point of adaptive content is measuring how well your choice, and content, performs for actual users. This is done with A / B tests, which is a form of competition between two versions (version A and version B) which are randomly distributed to visitors during a limited time. The one that continues to be used after the test is the one that performed best for users. This way, you know what works in the respective situation the users find themselves in. What works might differ between desktop computers and mobile phones, or other segmentations.
Image and media banks in your publishing system
For smaller organization, or those with less needs, it is certainly good enough with one of the image and file managers we integrate in our web content management system. These systems have a slightly different focus. The ones I have used are Imagevault, which has a bit of DAM functionality and an okay API, but its main strength is primarily that it is a common combination with Episerver CMS, and Fotoweb, a more complete suite for those who need advanced search capabilities and integration with software which professionals in the graphic design industry use, such as Adobe Indesign.
Before choosing an off-the-shelf system, we should be clear about what feature set we expect. Perhaps the following may be important to you too:
- Using a suitable resolution, regardless of the size of the image in the system.
- Images should be sent optimized for the Web. Will it be possible to have exceptions from system optimization in the case that this gives a poor result? Manually override when necessary?
- Does your DAM need to be able to manage video? Streaming and / or download?
- Access control, or are all files for everyone’s use?
- Should high-resolution content be sent to retina screens?
- Is it easy to add external images and keep track of the licenses we accepted?
For those who want manual control over the optimization of a picture, I can recommend, besides Adobe Photoshop, the app Imageoptim16 for Macs. Just drag and drop images or folders in the window and it will fix it all.
Smush.it17 is the service to use if you want automated optimization of images. It is also virtually lossless to the human eye.
Personally, I think manual image editing is good enough for most of us if we have good structure in the image or file management in the web content management system. If you want to take your image management for the Web to the next level, go for a DAM system, even though it is a big project to manage.
Personalization of information
With the help of meaningful metadata, services have more freedom in what to show to website users. Personalization is about matching content with an interested receiver, to be proactive in our communication. What is shown is controlled by the metadata – metadata about the content and metadata about intended users. Personalization is not about making the content personal; it is supposed to be individual and contextual, which works to some extent even when we do not know who the individual is. I bet you have used a service that managed to profile you and give you personalized content. A personalization I often run into is what Google presents in their Knowledge Graph, where things I search for turn up with geographic vicinity as a crucial personalized factor. The Medical History Museum in Gothenburg was chosen in the right-hand column instead of the corresponding museum in Uppsala. Probably because I was in, or, am associated with, Gothenburg.
There are two types of data about a user, namely:
- Explicit data which the user relinquished by logging on as a customer, with customer data, completed forms and other active means of sharing information.
- Implicit data that is about a user’s behavior that reveal information, like special interests, gender and other demographic data.
Examples of what you may know about each user are:
- Where they are located? There are many techniques to more or less accurately decide where a person is probably located.
- What equipment is used? If the user is on a computer, tablet, mobile phone or other device might affect their behavior and what they are looking for during a visit. For example, you probably will not buy a house via a mobile browser, but to browse pictures or look for facts is common.
- If it is their first visit, or a return visit? If there were earlier visits, cookies may be used, or if the visitor is logged in, you can store usage data without the user’s active consent. Did the user look at the same content during an earlier visit? Did they abandon a shopping cart with contents that may be worth reminding them about? To some extent, we can make use of so-called remarketing through ad systems to get returning visitors to landing pages optimized for them and highlight products to remind them about.
- Where was the user before they ended up on your page? Was there a search for something specific in a search engine, a link from your own newsletter, or did the user come from a page with a context that should influence the content of your page – for example, a campaign on social media.
- Available information about a user’s preferences? If there is a logged-in customer, you can collect your own data. If your service is, for example, an online video service, there are certainly conclusions to draw on which type of video a user tends to look at, etc. Consider whether there is any data that you can take advantage of. The chosen language on a website certainly reveals more than just the mother tongue of a user. Bestselling books on Amazon most certainly vary with what language a user is fluent in.
- Does their navigational behavior reveal something? If a user stays within a category of content, or constantly jumps between different categories, can that explain any interest or possibly a wish to be surprised? Do search analytics since users of the search function enter words that explain what they are after.
In practice, you build categories of prospective users, one category probably needs to be ‘others´ which is where the default non-personalized users end up – the ones you don’t have enough data on just yet, the equivalent of a non-personalized website’s standard mode. If enough information about a user points to a specific category, the threshold is reached, and the user starts to receive targeted content. These categories are to be seen as indicative of what is shown on each page, some things may still be shown regardless of personalization. For example, travel agencies might prefer to suggest your nearest major airport of departure despite the personalization category you are in.
Examples of personalized content
If there are enough good reasons to affect a page’s content, it is sent in an altered form to the user, following a pre-defined template of alternation. It may be, for example, what products you suggest or which office’s contact information is shown. This is done using thresholds. These thresholds can be placed at different levels of the website depending on what type of website it is. For an international company, you can regard it as a competition of different contact details, where the user’s geographical vicinity and language are the determining factors. Thresholds function to determine if a user has met the criteria of a personalization category and secondarily, to decide which category, if there are several, wins.
Take a scenario of a global sporting goods store, for example. If the user lives in the northern or southern hemisphere can be quite crucial to the type of goods that are appropriate at a given time (because the seasons are opposite of course). Compare that with a global music store where there are also regional differences but probably not as dramatic – here it is more about providing what they can deliver on the user’s market.
To return briefly to the example of the online bookstore – where do you think they should preferably highlight textbooks for college students? In the ‘Private’ or the ‘Companies and Government library’ category? Most divisions are there to position the more relevant content for the different categories of users; however, it is not always designed that obviously so that you yourself as a user must choose between a blue and a red pill.
A lingerie boutique would probably be glad to know their user’s gender, on a normal day to display men’s underwear to men, but just before some holiday or special day to suggest gifts to a statistically probable girlfriend. Is there any pattern in the user’s browsing behavior that reveals gender? Is it possible to find out the user’s gender through an external advertising platform?
The car manufacturer Volvo differentiates those who intend to buy a Volvo, and those who have already bought one. Their needs are slightly different. A potential customer might need suggestions on financing through Volvo’s financial services, while those who already own a Volvo might need to be reminded of the benefits of original spare parts, and authorized workshops. Not only that, they try to get to know how close to buying a person is. My translation into English18:
“Volvo maps potential customers based on how close they are to buying. Those who seem to be far from making a purchase are going to look at the products, pictures and movies. Those who are closer to buying want to book a test drive, make detailed choices and get price quotes. Therefore, we present different content based on the user’s past behavior.”
– Mikael Karlsson, mobile manager at Volvo Car Corporation
Not everyone is as obvious as some tech outlets. When visiting some of them, you are prompted if you would like to enter as a company or as an individual consumer. If you choose to enter as a company, campaigns on buying servers and network equipment are shown on the homepage. If you instead choose to enter as an individual, prices include VAT and the enterprise offers are gone. Instead, they make room for wearables, televisions and gaming video cards.
Perhaps the simplest variant is all those websites trying to decide in which state or region a user lives and forwarding them to a regionalized landing page. Most often also visible in the address bar, regionalization affects what news and other things a visitor sees. In cases where it is not possible to decide the visitor’s location, a neutral variant, a default mode if you wish, appears. In essence, there are two categories of experiences within the very same website; those ‘positioned within a regional setting’, and ‘everyone else’.
In a content management system supporting personalization, it is important that there is a feedback loop for web editors so no content is created which is unlikely to reach a user of the personalized website. In what ways can you categorize your users or customers? Without annoying them?
Something that is guaranteed to annoy your users is when they try to follow a link that turns out to be dead. Time to talk a little about the delicate subject of URL strategy.
URL strategy for dummies
Glossary – URL (Uniform Resource Locator)
Is the full address of a web page or resource on the Internet. For example, http://mywebsite.com/contact/ according to the scheme protocol://domain.topleveldomain/subfolder-or-file.extension
Often called web address, or address for short.
URLs are addresses on the Web and are used to have a common reference to some kind of online resource. This something can be the homepage of a website, a sub-page, an image or any other type of resource. URLs are important for humans as well as for machines to be able to address web pages, uploaded files or data. Just as a physical street address is expected to persist over time, it is preferably that an established address lives long and has roughly the same content at your next visit.
It can be easy to forget what happens when you create a new page or put up new material. If the content attracted any attention at all, the risk of a broken URL is that the following occurs:
- Links from other sites to yours will break. It may be that the URL is used on someone else’s intranet, which you cannot figure out until you may eventually see a certain pattern in the statistics for your 404 error page.
- People who bookmarked the page end up on an error page. Nowadays few people bookmark in the browser to the same extent as before, but it is not yet an irrelevant point.
- The website loses value in search engines. It can be risky, especially for those who depend on search engines for their traffic, to scrap many addresses search engines are already familiar with. In part because an older URL is always worth more than a new one, but also because there is a built-in suspicion in search engine algorithms – they are after all battling against search engine spam on a daily basis.
Search engine optimizers think it is worth working hard to get natural links to a website. Seen through their eyes, it is obviously an incredible waste not to take an interest in established addresses that are already known to search engines.
Quality indicators of a URL is that it should:
- Be designed to persist over a very long time.
- Specify who the sender, or owner, is.
- Describe what is to be found at the address.
- Be as brief as possible, not contain non-essentials, and be easy to memorize or read out over the telephone.
- Follow the naming standard, i.e. not contain special characters, no capital letters, no underscores, etc.
- Have been around for a while, which is a sign of seriousness for a search engine.
- Refer to something unique, in other words, there should only be one way, a single URL, to reach the unique content.
- Be functional. If the address is hierarchical, it should be possible to hack it, erase parts of the address to reach a higher level in the structure.
- Send the correct status code according to HTTP. A missing page is 404, if the URL is moved to a referring URL, you should send status 301, and so on.
- Have some sort of spell-checking feature so it can cope with mishaps such as the (unnecessary) www prefix, with or without trailing slashes.
This requires, of course, inclusion in the design of a web system or inclusion as a requirement in procurement. All those who could influence the choice of addresses need to be informed about the supposed URL standard. Web editors should really spend more effort on the address than the header since the address is not something they can change without penalties later on.
Please document your view of a good URL strategy and try to follow it. If you have editors, it is a good idea to inform them, especially for time-bound information such as calendar events and news. Not infrequently, uploaded files in particular are uploaded without much reflection on the file name, which usually affects its own URL.
Common excuses for breaking established URLs
As the header suggests, I believe that on most occasions when broken addresses occur, it is because they prioritized something else other than their users’ best interests in mind. But it is hard to blame anyone as most people do not seem to have spent much time thinking about this subject. I have never encountered a URL strategy during my 18 years in the web industry – I may well have to write one myself someday.
Now some examples of what I have been told is the cause of broken URLs, and suggestions on how to work.
“But we have closed that website, now the same info is located over there…”
That a website is shut down is nothing strange or uncommon. Sometimes however, they are replaced by another website, on a different domain. Then you have the chance to retain some of the advantages of the established addresses. Here are some common variations on how not to handle old addresses:
- No matter what the requested address is, whether it has a new counterpart or not, the user is sent to the new website’s home page.
- Only when requesting the old home page, visitors are redirected to the new website. All addresses except the old home page are broken.
- The redirection of the old URLs is temporary, and after a couple of years stop working, since no one believes them to be used anymore.
If the first point affects you as a visitor, you will be surprised or annoyed, especially if you are in a hurry. It was not quite what you had hoped for. If the visitor is forwarded without warning, the new page has to meet the needs of the page originally requested; otherwise, you are supposed to say what has happened to the old website. Based on what pages are popular or important for other reasons, the user should be referred to other matching or corresponding content. More on that later.
“We have replaced our content management system and the new one could not handle…”
The requirement for any new web-related system should be both that it takes care of established URLs, which is not as complicated as you might think, and that new URL addresses are not limited by any form of system standard. Many governmental actors in my vicinity chose Episerver CMS in the early 2000s and therefore got a lousy system standard in the form of an unnecessary subfolder with design templates. For icing on the cake, the template’s name and page’s ID number also appear in the address.
Addresses such as www.municipality.se/municipality-templates/OrdinaryPage____67241.aspx were common and are still seen sometimes. Imagine the usefulness of those addresses if you are going to read them out aloud to anyone. How many underscores are there? Will you remember that address tomorrow?
When upgrading to a newer version of Episerver CMS, which had more sensible URLs, or if you replaced the CMS, there were already established, ugly addresses to be sweating over for a long period.
“The old addresses were so incomprehensible – the new ones are user-friendly”
Excellent. If you are interested in address quality, you should be interested in dealing with all the old address even though they were ugly. Right? The old ugly addresses often contained identification on how information is retrieved from a database, probably a series of figures. You can catch the user’s intent and serve the right content, even after a system change.
Taking care of old ugly URLs, or at least providing web editors with a manual tool, is something serious web agencies have offered for years. As usual, almost anything is possible, and such solutions are luckily on the cheap side. It makes sense in the majority of cases to do something and to make sure to add it to the requirements.
“We archive old pages ongoing”
Continuously archiving published pages is honorable in a way, but what is the reason? Sometimes you hear that news should be removed because it is out-of-date and that calendar events should be removed shortly after the event is over and done.
What is not thought of then is that the web is an excellent archive and information can have value at the end of the day even though its timeline stretches further back. Is there perhaps a problem with the information structure that makes you annoyed by older content? Does it pollute the content found in your search engine?
A counter-argument would be that, using metadata, you could instruct the search engine that the content is not very important, or perhaps ask the search engine not to index the page at all. Another choice that you may not have thought of is, for example, the design of the news list or the calendar function. News lists can be further developed to ignore news with an expiration date in the past, which would give the editor another option other than putting news in the trash to get it out of sight.
Calendar event page templates perhaps should be redesigned with a before-, during- and after-perspective. Before the event, an event page focuses on providing information, factual content and encouraging registration. During the event, the page can automatically switch to primarily guide those who cannot find their way, but also to add supplements giving event changes and report on hashtags used on Twitter, etc. After the event, perhaps a compilation of documents, captured images and the best of what the participants posted publicly on the Web would be suitable post-event content.
See the above as suggestions for what you can do instead of throwing a page in the trash – and therefore killing an established URL. There is certainly a solution that will make your website even better.
It is also okay to declare that a page is outdated and archived. Prefer a warning text on the page and making it a little harder to find rather than deleting it so that it cannot be found at all if needed.
Ok, how to then?
Just as it is now obvious that an address should work regardless if the user connects with a mobile, desktop computer or something else, it should be equally obvious that an address is to be valid everywhere. What is possible to show should be displayed.
In an increasingly digital world, in which we cooperate across borders, it is reasonable that all URL addresses should work whether I am navigating on an employer’s private network or a public one. If I find myself on the wrong network, or the URL points to a protected network, I should be served content of a level of access that is appropriate for me. The level can certainly be set to zero most times, but imagine a URL to a news item on the intranet of a government organization. If everyone is entitled to access it, then why not design the technology accordingly.
Some things should have extraordinarily good reasons to exist in a URL, for instance:
- The author’s name. It is not certain that the author will be the one who administrates this address throughout its entire lifetime.
- Subjects and other forms of categorization. This is probably the trickiest variant as it often feels like future-safe to classify content. It is good to keep in mind that the actual word for a classification tends to have a shorter life than we first expect. Here too are the extremely common hierarchical addresses. What will happen to the pages’ URLs if a parent page later on gets a new URL?
- Status of the content. Information status is supposed to change; therefore, you should omit it from the URL.
- File extensions like .html, .php and the like for webpages. For uploaded files, however, file extensions are okay. The problem with .php and the like is that they show system information, if you replace the system you might be forced to break all established addresses.
- Forced folder name or traces of system standards. In the past, we saw folder names such as /cgi/ but nowadays we more often see /cms-templates/ or anything that does not directly contribute anything more than length to URLs.
- Access levels. It is usually not that smart to have access group names in URLs when the name of these groups can be changed within the time a URL can be expected to live. It can cause problems if you come across an address created for an access level other than the one to which you belong.
- The date the content was created. When it comes to meeting notes it is okay, then the date actually belongs there, but otherwise dates in URLs eventually gives the impression that the information is old and not updated. That is not good. A URL must be maintained, with the aim of keeping it up-to-date and that this should be visible in the URL.
URLs in print and digital distribution
If URL content is to be printed, will it be out of your control, like the contents of a newsletter or similar? A reason to use a URL service is if you do not have control over the website you link to, such as making adjustments afterwards. Such services are available online or you can set up your own service. Several of the most popular services let you use your own, preferably short, domain name or to pick whatever domain the service offers. The difference is in how much control you want over time.
You create an address for a specific intended use, connect to where the address refers too and then the address is ready for use. The target can be changed in the administration interface of the service you are using and the address you used remains the same. Not only that, you also get statistics on the use of the address, which can be anything from a simple visitor count to more advanced analytics.
If you are willing to scrap addresses
Now if you have to break many addresses on your website, make sure to do a proper web analysis beforehand so you know what you are messing with. It might not be worth it.
A well thought-out archiving solution is your choice in this regrettable situation. My suggestion is that all purged addresses lead to a so-called error 404-page that informs users that the page no longer exists. You will find later on in this book how such a page can be designed.
The error 404 page itself needs to survive a long time to take care of dead addresses, which are sometimes entire domains that have been shut down. In addition to taking care of stray visitors, it should collect data to give insight into which of the previously used addresses are used the most. If there are enough people who visit a purged address and a new equivalent is present, it is a good idea to send the users on their way to the new content – it should not be the user’s problem that you broke the links.
Except to link popular defunct addresses, manually, to new functional ones, you can use search engine technology to give educated guesses about corresponding pages. For instance, suppose the government scraps all established addresses on one of its websites, and a user is using links to learn more about a specific federal agency.
The following URL is used, but no longer works in this hypothetical example:
The most obvious solution is to look at the address and realize that it contains things that describe the page, namely: federal administration native americans. Words that can be used as keywords for an automated search, supporting the error 404 page that may give suggestions on where to go.
It is also quite common to find numbers in URLs. A number might identify old content in a structured database and can be used to connect to a new address. If that were the case, only those who manage the website would know, but for you, it is worth checking out where the number came from and if it can be used for mass-redirection.
In case you want to save a lot of old addresses and know exactly how to do it, create a redirect rule from the error 404 page that applies to all addresses in /directory/federal/ and sends visitors on to the corresponding new addresses. Technically, this redirection should occur before an error 404 message is sent from the web server, but bring that discussion to your developer if necessary.
If no given redirection rule exists, it is worth picking up some of the best search results and telling the user that the information they were looking for could be among those results. If you are on a governmental website and want to enter a search query with all the words in the address of the federal agency’s page, the search engine can actually help. These types of searches can be done by your error 404 page and presented instead of leaving the visitor in the lurch.
I hope that not everything about information architecture was overwhelming for you. Now to some more light-hearted things, as the topic of web design is coming up.