Previous Topic | KDDM Main Page | Next Topic
1. R&D Profile of Knowledge Discovery in Databases and Data Mining
1.1 The Information Sources
"Data mining" (DM) has evolved through long-standing, though varied, use by statisticians, data analysts, and management information systems specialists [5, 10]. "Knowledge Discovery in Databases" (KDD) is a relatively recent field. This "Innovation Forecast" for KDD/DM profiles research and development (R&D) in this domain to help understand the work underway and assess likely future developments.
Our approach, is itself, a version of KDD/DM -- we apply selected KDD/DM tools to characterize our own field. We begin by searching in selected bibliographic databases to identify relevant work; then retrieve records relating to KDD/DM; analyze those records; and strive to represent the results in informative ways. After digesting information from the databases, we scan the Internet (Worldwide Web -- WWW) for additional information (Figure 1).
Figure 1 The Information Sources for this Report
We first searched for R&D activity on "data mining" in several databases (Figure 2). In exploring the DM data set, we noted the extensive use of "knowledge discovery." This led us to search as well on "knowledge discovery in databases"(KDD). In INSPEC, in particular, this yielded 307 records. Of those, 188 appear in common with the DM data set from INSPEC. Comparatively, we only found 4 documents with the term "knowledge discovery in databases" in COMP. INSPEC and COMP are the databases which collected much more KDD/DM documents than other sources. (Figure 2). INSPEC is a research-oriented abstract database emphasizing electrical engineering, physical science, and computing, containing some 3,000,000 abstracts for the period 1987-97. COMP, the Computing Index, is more oriented to trade magazines. We infer from this that KDD is a much more popular descriptor in academic research than in industrial circles. KDD and DM refer to processes of nontrivial extraction of implicit information, applying potentially useful knowledge rules, from databases[5][8].
We further probed how similar KDD and DM activities are. Counts of keywords (subject index terms) in the KDD and DM data sets showed that the top 50 keywords in KDD closely match those in DM abstracts. Furthermore, the top 20 authors and top institutions contributing to KDD are quite similar to those in DM. (In another analysis of the conceptually close fields of natural language processing and computational linguistics, we found striking dissimilarity between the respective author and institutional contributors [4].) We then considered what particular papers said about the relation between KDD and DM, deciding that the relationship is indeed close [5]. We therefore consolidated the KDD and DM abstract sets into a unified search set of 694 INSPEC records upon which most of the this report focuses. We pursue some additional analyses using two additional search sets -- 680 DM documents from COMP and 121 DM records from SCI (the Science Citation Index). Lastly, we retrieve KDD/DM product information from the Worldwide Web.
1.2 An Emerging Field --- KDD and DM
KDD/DM is a relatively young field. INSPEC and COMP, which collected many more KDD/DM documents than other sources, have less than 700 documents as of the end of 1997 (Figure 2).
Figure 2 Data Mining Documents in Selected Databases (search term: data adj mining)
Figure 3 indicates that this young field is growing rapidly. Growth rates are high in each of the INSPEC, SCI, COMP, ENGI (Engineering Index or EI Compendex), and BUSI (the Business Index) databases. The growth pattern of documents by years in the five databases are similar -- embryonic from 1992-1994, followed by rapid development from 1995-1997. This concurrent growth across databases ranging in emphasis from basic research to commercial use is unusual. KDD/DM appears to be simultaneous a "hot" research area and an emerging application form.
Figure 3 Documents by years in the Five Databases
Decker and Focardi suggest three factors stimulating the development of KDD and Data Mining[7],
Exploration of the KDD and DM work in INSPEC shows it dominated by a group of researchers from machine learning, databases management systems, and logic programming (see Figure 12). On another aspect, checking the KDD/DM documents in COMP, we find that a lot discuss commercially developing tools to address the explosive accumulation of data in different application domains. Certainly, there is a strong push behind data mining from the market side besides that from academic circles.
Figure 4 positions KDD and data mining with respect to other emerging data handling tools. For decades, good progress has been made at applying statistical analyses to small data sets. Database Management Systems (DBMS) and Information Retrieval (IR) capabilities then advanced data accessibility (1980's). Online Analytical Processing (OLAP) emerged in the 1990's, along with the escalation in Internet access and use. Data mining has rapidly advanced since the mid-1990's. We look forward to another generation of improved data access via national Distributed Information initiatives (e.g., the Digital Library Initiative) and efforts to derive Knowledge from diverse sources across space and time (e.g., the NSF KDI initiative of 1998).
Figure 4 KDD/DM in Relation to Information Techniques
The three trends in Figure 5 support this sense of progression. The number of articles in the trade-oriented database, COMP, addressing DBMS, OLAP, and data mining shows distinct patterning. Attention to DBMS as such is slipping (a maturing technology). OLAP, an extension of DBMS, has been increasing since 1993, with its special database indexing schemes, multidimensional structures, and intuitive access for sophisticated analytical queries. Data mining can be viewed as the next generation of tools to derive valuable "knowledge" from databases.
Figure 5 Comparison of DBMS, OLAP and Data Mining in COMP
1.3 Domain Activities Analysis
Fayyad, Piatetsky-Shapiro, and Smyth broadly outline basic KDD/DM steps[5]:
We simplify this into a simpler picture for portraying KDD and data mining techniques and applications. It includes three basic components (Figure 6):
Figure 6 KDD and Data Mining: Development Framework
This analysis applied our KDD approach and tool suite, called "Technology Opportunities Analysis" (TOA), described elsewhere on this web site. Using TOA, we derive two groups of high frequency terms from the 694-record KDD/DM INSPEC data set. One is the group of keywords (subject index terms); the other, the terms compiled from the records' abstract fields. (called NLP terms because they apply semantic and syntactic analysis to ascertain noun phrases). Table A in the appendix lists the 25 most frequent keywords and abstract words. Even simple inspection helps understand aspects of KDD/DM:
Figure 7 Databases types, the occurrences in KDD/DM data set from INSPEC
There are many kinds of data and databases used in different applications. Many applicable databases contain complex data types, such as structured data and complex data objects, hypertext and multimedia data, spatial and temporal data, transaction data, etc.[8]. Figure 7 tells us that "deductive databases" are a prominent target addressed by KDD/DM documents (34% of the INSPEC documents include the term). "Very large databases" are also heavily involved. Relational databases, distributed databases, and visual databases also evidence moderate occurrences. ("Relational databases" is a common term in DBMS; we have an indicator, Na.m, which better distinguishes terms more central to KDD and Data mining -- see Figure 8.)
Table B in the appendix tallies the application domains' occurrences from the INSPEC data set. This reflects basic list processing operations being applied -- searching for multiple variants of a term, grouping them, and applying thesaurus capabilities. We can see from the table that the application fields of KDD and data mining are quite broad. Applications noted include:
Relative emphasis("N.a.m") is the ratio of a keyword's frequency of occurrence in the KDD/DM document set to the keyword's occurrences in the overall source database (INSPEC). A higher ratio implies that the technical area is relatively particular to KDD/DM. Figure 8 locates the 36 technical areas which have higher occurrences in the INSPEC data mining set in terms of the size of each domain (total occurrences in the source database INSPEC, Y axis) and relative emphasis in KDD/DM (X axis).
Figure 8 Relative Domain Emphasis in KDD/DM
The relative emphasis indicator shows that association rules, very large databases, deductive databases, knowledge acquisition, rule induction, spatial databases, background knowledge, rough sets, and decision trees are relatively particular to KDD/DM in KDD and data mining. Excepting "knowledge acquisition," the top ten hot fields in KDD and data mining are rather small research areas based on total INSPEC coverage. This suggests that there is still a very broad space for fresh comers to take part in KDD and data mining. People may not have to worry who is ahead in those domains if they want to join in the development of KDD and data mining. On the other hand, domains spreading up the Y axis in Figure 8 are being studied mainly not in relation to KDD/DM, so that it is vital that researchers investigate work on them outside the KDD/DM field.
1.4 Domain Mapping through Clustering Techniques
TOA performs co-word analysis to infer underlying relationships in text documents. Such analyses are very useful to the analysis of bibliographic data in the form of field-structured abstract records, in particular, and often focus on keywords. In TOA, we seek first to cluster empirically related terms and then to depict relationships as "technology maps."
The TOA software provides capabilities to "mix and match" a number of grouping and linking algorithms. In addition we have experimented with a wide range of these, generally speaking, "clustering" approaches. One can examine co-occurrences among various types of terms -- for instance, keywords by authors, or affiliations by year, or whatever is of interest. Here, we focus on keywords by keywords, seeking insights into how these terms conceptually cluster together. One can group these co-occurring entities by such techniques as Latent Semantic Indexing [26, 27], Principle Components Analysis (PCA), factor analysis, hierarchical cluster analysis, or Maximum Likelihood Intensity Similarity analysis [28]. Then, one may want to graphically represent relationships. Techniques such as spanning trees, Pathfinder [29], multi-dimensional scaling, or path-erasing [2] come into play.
We illustrate the genre with a particular two-stage Term Cluster Mapping approach [2]. Basically, the approach
The term cluster mapping process entails three steps (Figure 9):
Figure 9 The Term Clusters Mapping Process
For Step 1, we chose the 106 keywords occurring 5 or more times in the KDD/DM data set, excluding the search terms and direct derivatives of them. We applied PCA -- a well-recognized statistical procedure that generates linear combinations of the input variables (in this case, the occurrence pattern of the 106 keywords across the 694 documents), such that the first such "factor" explains the most possible variance; the second factor, the most possible remaining variance; and so forth. We extracted 30 such factors, then rotated the factors so that keywords would tend to relate either highly or not to each factor (Varimax rotation). We then applied a heuristic to identify the keywords relating closely to each factor -- so-called "high loading" terms. These are listed in Table D in the appendix. We invite the reader to browse through these 30 factors to note that they generally "make sense." These are empirical inductions based on statistical analysis, suggesting the power of the PCA statistical approach to infer relationships.
For Step 2, one must decide how to gauge the similarity of the clusters (the factors). We used a group-average method. For example, consider Factors 4 and 5. We average the Pearson correlations (i.e., a normalized co-occurrence measure of how often the two terms appear together in documents) between each term of Factor 4 and each term of Factor 5. That is, between "visual databases" (one of the two high-loading terms of Factor 4) and "decision theory" (one of the two terms of Factor 5); "visual databases" and "trees" (the other term of Factor 5); "spatial data structures" (the other high-loading term of Factor 4) and "decision theory"; and between "spatial data structures" and "trees." Carried out for all 30 factors, this yields the similarity matrix of clusters (factors).
Step 3 consists of representing these similarities among factors in a two-dimensional map (Figure 10). Positioning in the map is relative, representing which factors are empirically most central to this data set (KDD/DM). "Centrality" is based primarily upon the strength of similarities to the set of factors. Links shown are based on our "Path Erasing" approach. This is an algorithm that begins with each entity linked to every other, then removes links to a designated threshold level. The level can be adjusted; the intent is to convey the main relationships. (In other words, the absence of a link does not indicate total independence, rather notably less association.) Figure 10 shows "Deductive Databases" and "Distributed Databases" as prominent centers, themselves linking particularly through "Relational Databases." Note that these are factors, not individual terms. "Deductive Databases" is our name for Factor 1, composed of two terms -- "knowledge acquisition" and "deductive databases." In other words, Figure 10 is showing relations among clusters of keywords, not among the keywords per se.
Figure 10 Term Clusters Map
Examination of the similarity matrix of factors, leads us to suggest two major domains within this KDD/DM research. Domain "A" centers around "Deductive Databases." Some of the individual keywords (note that these are not shown in Figure 10 which only shows keyword clusters) correlating strongly include:
Domain A appears more research-oriented; Domain B, more application-oriented.
The high-loading terms of "Deductive Databases" are much more prevalent than those of "Distributed Databases" in the INSPEC search set (238 vs. 34 documents represented, of 694),) suggesting that "deductive databases" is a hot area in KDD and data mining. We also found that "very large databases" is the most closely related term to "deductive databases," suggesting that "deductive databases" could have strong connections and important applications in the future of KDD and data mining, although it is still research-driven right now. Domain B -- "Distributed Databases" -- appears to be an emerging emphasis for KDD/DM, possibly worth special attention by the field.
1.5 Who are the main players in KDD/DM right now?
There are many authors contributing to this KDD/DM literature. After applying TOA fuzzy matching and thesaurus routines, with our review, to combine variants of the same name, we identify 1142 authors for these 694 articles in INSPEC -- obviously multiple authorship is common practice. Table C of the appendix shows the 23 most prolific authors, with 6 or more KDD/DM publications each. The prolific authors in the DM data set retrieved from the COMP database are much different from the KDD/DM data set of INSPEC. This suggests that there are two different authors' groups in data mining, one is in the circle of data mining research, another is in commercial products development.
Figure 11. The Relationship of the Prolific Authors in INSPEC
and the Frequently Cited Authors in SCI
By analyzing the DM data set of SCI, we found that:
On an individual level, we have linked our homepage to a number of the leading authors' homepages. One could explore the worldwide web to note that Fayyad has moved from Caltech's Jet Propulsion Lab to Microsoft. Such exploration offers another angle on the backgrounds relating to KDD/DM. For instance, Mannila [6] notes that KDD/DM draws upon machine learning, statistics, and databases; Klemettinen et al. [11] point toward the interface of computer science and statistics in KDD/DM. Their self-descriptions suggest strong ties to machine learning (for 9 of the 23 prolific authors), database systems (8), and various artificial intelligence areas (5). Interestingly, none of these prominent KDD/DM authors designate themselves as statisticians (Figure 12).
Figure 12 Research Backgrounds of Prolific KDD/DM Authors
Will the statisticians join KDD and data mining in the future? Or, will statisticians try similar efforts to KDD/DM, just using some other terminology? For statisticians, economists, and other quantitative researchers, "data mining" is a pejorative term. It refers to the practice of selectively trying to find data that will support a particular hypothesis[41], and It is usually possible to find data to support any theory.
We next pursue the institutional affiliations of the authors. In all, 398 affiliations are identified for the 694 INSPEC articles; those associated with the most publications are listed in Tables E and F in the appendix. Some 27% of the contributions are from companies, led by IBM. The interests of specific companies can be probed further by searching for websites (e.g., http://www.almaden.ibm.com/almaden/). One can pursue these explorations many ways. For instance, inspection of the affiliations of the most prolific authors shows a similar split between company associations (22%) and academic (74%). This suggests that KDD/DM research is predominantly academic, but with a significant industry involvement.
Previous Topic | KDDM Main Page | Next Topic