TPACTechnology Policy and Assessment Center
 
 

Technology Mapping --- An Application on the Internet Domain

Donghua Zhu, Alan Porter, TPAC, 02/07/99

Figure B Team Players or Individual Efforts? Comparison in Four Domains

We have developed a partly automated approach to generate a family of "technology maps." This enables us to extract, then visualize, relationship patterns based on topical searches conducted in scientific and technological abstract databases.

The "Bigmap" program based on this combination of new mapping algorithms has the ability to automatically extract and transform "co-occurrence" information from] large sets of abstract records. Co-occurrence is usually based on the patterns of terms occurring in the documents. Terms might be "keywords" (subject index terms) or noun phrases generated from titles or abstracts using our natural language processing (NLP) routine. Bigmap provides six different relationship maps:

  • factors (groupings of topics) map;
  • keywords map;
  • affiliations map;
  • authors map;
  • countries map;
  • sources (e.g., journals) map.

The method further provides visualization solutions that make it possible to create maps in Microsoft Word or PowerPoint files using visual basic macro's.

Here is an example which demonstrates an application of Bigmap to the topic of the "internet." Determination and visualization of the latent relationships and relative importance of different elements can support technology management. These results are based on a collection of 2871 abstract records generated by searching for the term "internet" and the year 1998 in the INSPEC database. INSPEC is a large compilation of abstracts of journal and conference papers in the general areas of electrical engineering, computing, and the physical sciences. It is produced by IEE and available various ways (e.g., through "Dialog" or by subscription).

  1. Factors map -- represents the relationships among conceptual clusters (factors generated using principal components analysis, reducing to high-loading terms). An example factors map of the internet data set is shown as figure 1.
  2. Keywords map -- represents the relationships among frequently occurring subject index terms. An example keywords map of the internet data set is shown as figure 2.
  3. Affiliations map -- represents the relationships of affiliations' research topics, based on terms they use in their documents. ( figure 3)
  4. Authors map -- represents the relationships among authors, based on the commonality of their research topics. That is based on terms used in their abstracts. (figure 4. )
  5. Countries map -- depicts the relationships among countries, displaying their main interests, based on terms they most often use in their documents. Mapping of countries can display the main interests and topics of the countries generating the most articles on the topic (the "internet"). ( figure 5a shows a 2-dimensional representation; figure 5b shows three dimensions.)
  6. Sources map -- depicts the relationships among document sources, such as journals and conferences. Source mapping can show the main interests and topics covered by leading journals and conferences. It "links" sources based on commonalities off the frequently occurring keywords in the published papers. Source mapping can help users rapidly find key sources on particular sub-topics ( figure 6 ).

In these six example maps, we used 1) the top 217 terms, excluding the search term "internet"; 2) 15 factors; 3} the top 20 keywords in the terms map; 4) the top 15 affiliations; 5) the most prolific 15 authors; 6) the leading 15 countries; and 7) the top 16 sources. [The internet data set is "inter98" in Dnghua's tpac account.]

The six example technology maps were made by extracting and representing co-occurrences and correlative information in the data set. Other technology information resides within and outside the search data set. We believe that additional insightful representations can be produced by mining that additional information. A test program called "IM" (Indicator Mapping) has run successfully on the GT Unix machine. It can automatically extract information from GTEL and produce an indicator - "Na.m." Na.m is a domain or term's normalized association measure with a given data set in a certain period. It measures the domain/term's relative association with a given data set, see our report "TOA in Data Mining"). N.a.m is the ratio of a keyword's frequency of occurrence in the user's data set to the keyword's occurrences in the overall source database. A higher Na.m implies that the technical area is relatively particular to the user's data set. A lower Na.m implies that the term is an "universal" term, may be taken as a noise term. This is the first such innovation indicator from TOA generated automatically.

Recently we found that the log(Na.m) can be plotted versus the log(Domain Size) to yield an informative two-dimensional map. This can help discern noise terms from more valuable terms in the user's data set. Figure 7 shows an example map for the "internet" data.

In figure 7, terms that lie below the "diagonal" measure and are very large in size -- such as "information technology" or "software tools" -- can be taken as noise terms from a semantic view. On the other hand, terms which are above the "diagonal" measure and exhibit higher Na.m -- such as "online front-ends," "transport protocols," "client-server systems," "security of data," or "object-oriented languages" -- are relatively particular to the data set. We can take them as particularly important terms in depicting research activity pertaining to the "internet" data set.

We are developing other innovation indicators. We think important latent patterns in users' data sets can be discovered from such indicator maps. In the near future, we expect that an analyst can create a set of innovation indicator maps, such as the figures in our "KDD/Data Mining" and "Fuel Cell" reports, by just striking a few keys and making a few choices.