2. Innovation Context Indicators
We introduce a set of derived indicators intended to help understand the status and prospects for ongoing innovation (development toward application) of the topic under study -- in this case, KDD/DM. These include:
2.1 Communication Need - an Indicator of How "Hot" a Domain Is
Rc.j(domain, t) is the ratio of conference papers to journal papers in a domain. For example, Rc.j(neural net, 1990) is the ratio of neural net conference papers to neural net journal papers in 1990; Rc.j(neural net, 1992-1997) is the ratio of neural net conference papers to neural net journal papers from 1992 to 1997.
Rc.j suggests a need for "instant" communications in a certain domain. A very interesting fact we have verified is that Rc.j is proportional to how "hot," or active, a domain is. For example, as we know, "Very large database(s)" is a hot sub-domain of "database(s)" currently, and it is expected that Rc.j (very large database(s)) should be higher than Rc.j(database(s)) -- it is: 1.68 vs. 1.17.
We take Rc.j(INSPEC) = 0.528 as the average for all records in INSPEC ( 2.9 million for 1986-97) -- i.e., the ratio of total conference papers to total journal papers in INSPEC. If the Rc.j(domain) is less than this average, then it usually indicates a less active domain, such as "statistics" -- Rc.j(Statistics) = 0.506 (see Figure 13). If Rc.j of a domain is much higher than the average, it could be an extremely "hot" research area. However, we recognize that the norms of publication via conferences and journals vary considerably among fields so that this measure must be interpreted with some care.
Figure 13 The Ratio of Conference Papers to Journal Papers in Different Domains
KDD/DM researchers are more inclined to present their work via conferences than through archival journals in recent years. The Rc.j for the KDD/DM data set is unusually high -- 3.0 (Figure 13). In addition to the comparative values shown in Figure 13, we mention values of 0.578 for "Decision Theory," 5.28 for "Spatial Databases," 4.64 for "Association Rules," 2.46 for "Visual Databases," 2.39 for "Deductive Databases," and 2.35 for "Genetic Algorithms." We interpret this to suggest that KDD/DM is a relatively "hot" research area in which speed is of the essence in dissemination of results.
In order to compare domains over time, Figure 14 shows the general Rc.j (INSPEC, 1989-1997) and the Rc.j (Aneural net," 1989-1997), with Rc.j (KDD/DM). The "neural net" domain shows a gradual decline in emphasis on conference papers, consistent with a still very active, but maturing, research area.
Figure 14 The ratio of conference papers to journal papers in time series
Rc.j(Data Mining/KDD, 1995) = 8.14, which is extremely high. We may infer that KDD/DM was strongly promoted in 1995 through many conferences. The year 1995 may be the milestone in the history of KDD/DM. We did find that KDD and data mining progressed into a rapid development phase in 1995 (Figure 2). KDD and data mining might become more formally recognized terms since 1995. Integrated with other evidence, Rc.j can help identify the different phases of a technological innovation process.
Although Rc.j(Data Mining/KDD) has dropped over the last two years, it is still at a high level -- 3.4 in 1996 and 2.03 in 1997 -- which is more than 4 times the average Rc.j(INSPEC, 1997).
Figure 15 provides perspective on how "hot" the most relevant KDD/DM domains are. Figure A in the appendix rank orders all the prominent KDD/DM domains.
"Spatial databases" and "Association rules" have become prominent in KDD/DM research recently; both are quite "hot" by the Rc.j measure. In contrast, "Statistical analysis" and "Decision theory" appear to be much less active domains.
Figure 15 Hot Domains Vs. Less Active Domains
2.2 Teaming vs. Individual Research
A high Na.p.p (number of authors per paper) for a domain implies team research predominates, suggesting application, experimental, or interdisciplinary group research. A low Na.p.p suggests more individualistic, theoretical efforts. The KDD/DM data set averages 3.23 authors per paper. In order to make a comparison, we retrieved three other data sets from INSPEC: "Database Systems," "Machine Learning," and "Statistical Analysis." These, respectively, average 3.43, 3.18, and 2.94 authors per paper. KDD/DM collaboration patterns lie in the middle range, quite similar to those for "Machine Learning"(see Figure B in the appendix). Figure C in the appendix gives the Na.p.p for the 36 high frequency keywords in the KDD and Data mining data set.
It appears that neural nets, expert systems, genetic algorithms, data visualization, and association rules are domains in which interdisciplinary groups, team cooperation, or experimental efforts dominate. In contrast, research on data warehouses, business data processing, and the world wide web seem to be dominated by individual contributions.
Figure 16 The Relationships Between the Size of Domains and Average Research Team Size
Figure 16 shows the size of domains (number of publications indexed in INSPEC) and the size of research teams. It shows great range. Smaller areas tend not to have as much teaming, but even for these, there is tremendous variability. "Neural nets" leads in average number of authors per paper (almost 4.5). the domains with the smaller team size are "data warehouse," "business data processing," "world wide web," "information retrieval," "rough sets," "rule induction," "very large databases," "decision theory," "computational complexity," "background knowledge," "statistical analysis."
2.3 Size vs. Growth Rate
INSPEC keywords provide one way to get at related technical topics. We focus now on the leading 36 keywords in the KDD/DM data set. (One might prefer to use the 30 factors, composites of the keywords, but we chose the pre-established individual keywords to keep the analysis simpler.) Two key aspects of technical topics in characterizing a research domain are the Size (total activity) and Growth Rate (change in that activity over time). For each of these 36 keywords, we examine their profile in the overall INSPEC database. We group them into eight categories based on Intensity and Growth Rate, the most interesting being:
There is only one domain in group III, genetic algorithms. The domain in group III is medium- sized and with high growth rate. "Genetic algorithms" is associated with the field of machine learning. Compared with neural nets, another sub-field of machine learning, genetic algorithms is smaller but faster growing.
Figure D and Figure E in the appendix show the Tn.p (total number of papers) and Ar.p.p (growth rate) of the 36 high frequency keywords in KDD/DM data set. These range by almost 3 orders of magnitude in size and by a factor of 4 in growth rate.
Figure 17 Domain Size vs. Growth Rate