█ BRIAN HOYLE
Data mining refers to the statistical analysis techniques used to search through large amounts of data to discover trends or patterns.
Data mining is an especially powerful tool in the examination and analysis of huge databases. With the advent of the Internet, vast amounts of data are accumulating. As well, the amount of data that can be generated from a single scientific experiment where stretches of DNA are affixed to a glass chip can be staggering. Visual inspection of the data is no longer sufficient to make a meaningful interpretation of the information. Computer-driven solutions are required. For example, to analyze the DNA chip data, the discipline of bioinformatics—essentially a data mining exercise—emerged in the 1990s as a powerful melding of biology and computer science.
The collection of intelligence and the monitoring of the activities of a government or an organization also involves sifting through great amounts of data. Coded information can be inserted into data transmissions. If this information escapes detection, it can be used for undesirable purposes. The ability to extract the suspect information from the background of the other information is of tremendous benefit to security and intelligence agencies.
An example of data mining that is of relevance to espionage, intelligence and security is the use of computer programs—such as the Carnivore program of the United States Federal Bureau of Investigation—to screen thousands of email messages or Web pages for suspicious or incriminating data. Another example is the screening of radio transmissions and television broadcasts for codes.
The formulas used in data mining are known as algorithms. Two common data mining algorithms are regression analysis and classification analysis. Regression analysis is used with numerical data (quantitative data). This analysis constructs a mathematical formula that describes the pattern of the data. The formula can be used to predict future behavior of data, and so is known as the predictive model of data mining.
For example, from a database of terrorists who have corresponded using emails, predictions could be made as to who will send an email and to whom. This would aid efforts to intercept the transmission. This type of data mining is also referred to as text mining.
Data that is not numerical (i.e., colors, names, opinions) is called qualitative data. To analyze this information, classification analysis is best. This model of data mining is also known as the descriptive model.
The data mining process involves several steps:
- Defining the problem.
- Building the database.
- Examining the data.
- Preparing a model to be used to probe the data.
- Testing the model.
- Using the model.
- Putting the results into action.
Database construction and model preparation—in essence the building of the framework for the mining exercise—requires about 90% of the data mining effort. If these fundamentals are done correctly, the use of the model will uncover the data that is of potential significance.
In July 2002, the Intelligence Technology Innovation Center, which is administered by the United States Central Intelligence Agency (CIA), pledged up to $8 million to the National Science Foundation, to bolster ongoing research into data mining techniques. United States intelligence officials suppose that terrorist organizations use Web pages and email to send encoded messages concerning future activities. Currently, unless a message is accidentally uncovered, only monitoring every Internet transmission from a region can reliably discover the covert information.
Also in 2002, the U.S. Federal Bureau of Investigation and the Central Intelligence Agency, under the direction of the Office for Homeland Security, have begun the joint development of a supercomputer data mining system. The system will create a database that can be used by federal, state, and local law enforcement agencies. Currently, the FBI and CIA have their own databases.
Another aspect of data mining is the linking together of data that resides in different databases, such as those maintained by the FBI and the CIA. Often, different databases cannot be searched by the same mechanism, as the language of computer-to-computer communication (protocol) differs from one database to another. This problem also hampers the development of bioinformatics (the computer-assisted examination of large amounts of biological
data). Increasingly, biological and computer scientists are advocating that databases be constructed using a similar template, or that they be amenable to analysis using the same search method.
█ FURTHER READING:
Edelstein, Herbert A. Introduction to Data Mining and Knowledge Discovery, Third Editon. Potomac, MD: Two Crows Corporation, 1999.
Han, Jiawei and Micheline Kamber. Data Mining: Concepts and Techniques. New York: Morgan Kaufmann Publishers, 2000.
What You Need To Know About. "Data Mining: An Introduction." About.com. < http://databases.about.com/library/weekly/aa100700a.htm >(17 December 2002).