
Data quality and error in data are often neglected issues with environmental databases, modelling systems, GIS, decision support systems, etc. Too often, data are used uncritically without consideration of the error contained within, and this can lead to erroneous results, misleading information, unwise environmental decisions, and increased costs.
Most corporate data consists of various databases linked by real-time and batch data feeds. The data continuously move about and change. The databases are endlessly redesigned and upgraded, as are the programs responsible for data exchange. The typical result of this dynamic is that information systems get better, while data quality deteriorates. This is very unfortunate since it is the data quality that determines the intrinsic value of the data to the business and consumers. Information technology serves only as a magnifier for this intrinsic value. Thus, high-quality data combined with effective technology is a great asset, but poor-quality data combined with effective technology is an equally great liability.
A transcription error is a specific type of data entry error that is commonly made by human operators or by optical character recognition programs (OCR). Human transcription errors are commonly the result of typographical mistakes, putting fingers in the wrong place during touch typing is the easiest way to ascertain this error. The slang term “stubby fingers” is used for people who commonly make this mistake. Electronic transcription occurs when the result of a scan of printed matter is compromised or in an unusual font e.g. – if the paper is crumpled, or the ink is smudged when wet, the OCR may have a transcription error when reading.
Examples of Transcription Error
Input: Joseph Miscat
Instead of: Joseph Muscat
Input: 23rd of August
Instead of: 23 August
Input: Jishua
Instead of: Joshua
Transposition errors are commonly mistaken for transcription errors, but they should not be confused. As the name suggests, transposition errors occur when characters have “transposed” — that is, they have switched places. Transposition errors are almost always human in origin. The most common way for characters to be transposed is when a user is touch typing at a speed that makes them input one character, before the other. This may be caused by their brain being one step ahead of their body.
Examples of Transposition Error
Input : Gergory
Instead of: Gregory
Input: 23rd of Auguts
Instead of: 23 August
Input: Johsua
Instead of: Joshua
The most obvious cure for the errors is for the user to watch the screen when they type, and to proofread. If the entry is occurring in data capture forms, databases, or subscription forms, the coder of the forms or the database administrator should use input masks or validation rules.
Validation is a process used to determine if data are inaccurate, incomplete, or unreasonable. The process may include format checks, completeness checks, reasonableness checks, limit checks, review of the data to identify outliers (geographic, statistical, temporal, or environmental) or other errors, and assessment of data by subject area experts (e.g. taxonomic specialists). These processes usually result in flagging, documenting, and subsequent checking of suspect records. Validation checks may also involve checking for compliance against applicable standards, rules, and conventions. A key stage in data validation and cleaning is to identify the root causes of the errors detected and to focus on preventing those errors from reoccurring.
Data cleaning refers to the process of “fixing” errors in the data that have been identified during the validation process. The term is synonymous with “data cleansing”, although some use data cleansing to encompass both data validation and data cleaning. It is important in the data cleaning process that data is not inadvertently lost, and changes to existing information be carried out very carefully. It is often better to retain both the old (original data) and the new (corrected data) side by side in the database so that if mistakes are made in the cleaning process, the original information can be recovered.
A number of tools and guidelines have been produced in recent years to assist with the process of data validation and data cleaning of species data. The process of manual cleaning of data is a laborious and time-consuming one and is in itself prone to errors.
The general framework for data cleaning is:
- Define and determine error types
- Search and identify error instances
- Correct the errors
- Document error instances and error types
- Modify data entry procedures to reduce future errors