TPACTechnology Policy and Assessment Center
 
 

3. Technology Opportunities and Competitive Analysis

Examination of the set of 36 technical areas prominently associated with KDD/DM articles yields additional insights. First, there is a general trend for those topics engaging industry researchers to be fast growing (see Figure 18). Two groups of technical areas deviate from this general trend. The following areas show relatively slow growth with considerable industry involvement (Pc.p.p): data analysis, information retrieval, relational databases, expert systems, and knowledge based systems (GroupI domains, from Figure 18). This suggests possible maturing of the field or slowdown in terms of research, but some of them promise commercial potential. The second deviant group (II, from Figure18) shows relatively high growth rate but lacks strong industry participation. (Within this group, " Rough sets" is still small so might well attract industry attention if activity increases.)

Figure 18 The Correlation of Company Participation and Domain Growth Rate
Figure 19 The Development Patterns of the Domains with High Growth Rate

Focusing on these 12 terms that show particularly compelling growth, we now divide them according to the extent of industry involvement in R&D for each. The rationale is that increasing industry involvement reflects increasing commercialization opportunity. Figure 19 locates each technical area by its growth rate (Y axis) and industrial involvement (X axis). We observe:

  1. "Data warehouse" is apparently mainly driven by companies at present.
  2. Research on "association rules," "business data processing," and "very large databases" (group II in Figure 19) is driven by both companies and academic units, but companies are notably active in these three areas.
  3. "Data visualization," "visual databases," "spatial databases," and "KDD/data mining" are pushed by both industrial and academic units.
  4. "Rough sets," "genetic algorithms," and "pattern classification" are pushed largely by academic units currently
  5. "World wide web" shows an extremely high growth rate, with both companies and academic units paying great attention.

Figure 20 locates the 36 technical areas, as just discussed, in terms of relative industry involvement (Y axis) and relative emphasis in KDD/DM (X axis). The two technical areas most concentrated in the KDD/DM domain -- "association rules" and "very large databases" -- both show especially strong industry participation. IBM is notably active in publishing on "association rules"; a number of companies are publishing aggressively on "very large databases" (IBM, AT&T, Microsoft, Thinking Machines, SAS Institute, Oracle, MCC). It's interesting that the remaining KDD/DM technical areas are mainly academic, with the exception of "business data processing" and the striking outlier, "data warehouse." This suggests that much of the KDD/DM basic approaches/techniques are still predominantly being addressed in academia. Industry might want to track developments in these domains with special attention to identify early opportunities for commercial application.

Figure 20 Relative Industry Involvement and KDD/DM Concentration

Figure 21 and figure 22 portray the percentage of documents in INSPEC (R&D database) and COMP (trade database) that mention "product(s) or software" -- a candidate indicator of commercial readiness -- along the X axis. The number of documents noting a gi ven topic is shown both by circle size and Y axis position -- redundant for emphasis. For example, the very large number of articles in COMP on "relational databases," "distributed databases," and "object-oriented databases" is reinforced by the extensive mention of "product(s) or software" in those articles -- i.e., they appear in the upper right quadrant of Figure 22.

Figure 21 Domains' size vs. Commercial Development (from INSPEC)

Visual databases" illustrates the potential of these depictions. From Figure 21 we note that a lot of R&D is being directed to this topic. From Figure 22, we note that while current trade publication attention to "visual databases" is limited, what there is strongly points to interest in "product(s) or software."

Figure 22 Domains' size vs. Commercial Development (from COMP)

"Deductive databases" presents another intriguing case. In Figure 21, the INSPEC R&D data, it profiles much like "visual databases" -- a lot of activity with moderate pointers toward "product(s) or software." But in Figure 22, COMP shows minimal attenti on to the topic, implying a lack of commercial interest.


3.2 Data Mining tools from the WWW
The number of host computers on the Internet has leapt from about 200 in 1980 to over 10 million in 1996. The challenge for us is whether we can mine "innovation forecasting" knowledge from this huge source of information.

Table G in the appendix lists 140 KDD/DM tools found on the WWW. In INPEC and COMP we have fewer than 700 research articles abstracted on KDD/DM, but we have identified 140 KDD/DM tools from WWW! (Most are from "Knowledge Discovery Nuggets" a famous web page in KDD/DM.) We infer from this that a lot of companies which did develop data mining tools may not make significant contributions to the research literature.

Figure 23 partitions tool-oriented domains into industrially emphasized and academic. The tools relating to "multiple discover tasks," "classification," and "visualization" form the mainstream of commercial development. Tools for "link analysis," "clust ering," and "summarization" are heavily academic. Figure 23 suggests that the tools in "text mining" and "summarization" are still in the rudimentary stage of development.

Figure 23 KDD/DM Tools: Commercial vs. Non-profit

Figure 24 looks into the techniques on which classification tools are based. We found that the "neural network" and "decision tree" approaches are the main bases of commercial tools in classification, and the tools based on "fuzzy logic" and "rough set" approaches are still primarily academic.

Figure 24 The approaches on which "Classification" tools are based:
Company-developed vs. Non-profit-developed


Figure 25 surveys the application fields for which the KDD/DM tools are developed. "Marketing," "banking," and "scientific research" are the main application domains for KDD/TOA tools.

Figure 25 KDD/DM tools for different application domains

3.3 Competition Analysis

The COMP search set identifies 159 companies associated with the "data mining" articles. Figure 26 shows the most active 18 companies. Of these, IBM, Microsoft, and Silicon Graphics also publish actively on KDD/DM (INSPEC search results -- Table E in the appendix).

The companies associated with data mining appear to fit three categories:

  1. large information technology companies (e.g., IBM, Oracle, Microsoft)
  2. relatively new, small companies whose main attention appears to be on data mining
  3. companies applying data mining in their business (e.g., Wal-Mart, MCI)
Figure 26 The Leading Companies with Data Mining from COMP

One can track the development of data mining tools within a company. Figure 27 summarizes IBM's data mining profile, based on compilation from the COMP articles (Table H in the appendix details this). Presently, these data are compiled by hand, but we intend to partially automate the process in future TOA generations.

Figure 27 A Slide of IBM Commercial Effort in Data Mining

Determining the thrust of the small, new companies is more difficult; they are not covered with great frequency in COMP. To point out companies highly focused on data mining, we experiment with indicators based on emphasis. Figure 28 reflects the results of searching in COMP for all items associated with companies that have been active in data mining. NeoVista Solutions Inc. clearly is a data mining company.

Figure 28 The Percentage of Documents on Data Mining by a Company to all the Documents by theCompany in COMP

In contrast, note that only a small proportion of IBM's activities link to data mining (0.13%). Figure 29 presents a different perspective -- note the dramatic upturn in IBM emphasis on data mining in 1995. NeoVista and other would-be data mining enterprises beware.

Figure 29 IBM Commercial Effort in Data Mining -- Percentage of Data Mining Documents in all the Documents Which Related to IBM

Table I in the appendix collects software price information from the COMP records. While one would not think of COMP as a source of pricing information, this does help determine the ballpark. The high-end or server data mining software licenses are appearing in the $30,000 to $150,000 range (definitely not mass market). Desktop software ranges strikingly from $500 to $50,000.

Commercial interpretation of data mining differs somewhat from that of the research community that we offered earlier. The commercial perspective poses data mining as the next step beyond OLAP for querying data warehouses. Data mining goes beyond routine querying to sift through information to elicit underlying relationships. Table J in the appendix offers a sampling of observations on the markets for data mining. Potential is huge, but some worry about putting data mining to use [39]. Developing this market will be challenging as customers are largely unfamiliar with the technology and legacy data tuning are not readily warehoused [40].

Conclusions

This report shows how TOA can be used to profile and assess developments in an emerging technology -- Knowledge Discovery in Databases (KDD) and Data Mining. This working document offers a variety of new innovation indicators that can contribute to managing interests in this technology. It also illustrates the potential of TOA in analyzing a complex, evolving research domain. Such analyses can focus on
  • monitoring recent R&D activity
  • profiling who is doing what (e.g., as competitive technical intelligence)
  • forecasting promising developments (e.g., modeling trends)
  • assessing implications (e.g., to fufill particular functional needs, to penetrate particular markets)
In brief, the report addresses who is doing what in KDD and Data Mining. It finds this is an emerging R&D area with distinct commercial prospects -- it is hot! Exploration of the component areas helps map the important elements and how they fit together. Assessment of the activity trends in different areas points to opportunities, particularly in bringing together work in nearby areas.

We identify the most active players in KDD and Data Mining, and provide information on their interests. We call attention to the roles of IBM and Microsoft.

We invite your suggestions. The Technology Policy and Assessment Center continues to develop the TOA approach. We work with Search Technology, Inc., to develop the software. We partner with IISC to analyze topics of interest by applying TOA tools to bibliographic data sources in conjunction with other sources of information (e.g., expert opinion).

The report is available in "pdf" format.