Summary: | The planning and management of research and development is a challenging process which is compounded by the large amounts of information which is available. The goal of this project is to mine science and technology databases for patterns and trends which facilitate the formation of research strategies. Examples of the types of information sources which we exploit are diverse and include academic journals, patents, blogs and news stories. The intended outputs of the project include growth forecasts for various technological sectors (with an emphasis on sustainable energy), an improved understanding of the underlying research landscape, as well as the identification of influential researchers or research groups.
This paper focuses on the development of techniques to both organize and visualize the data in a way which reflects the semantic relationships between keywords. We studied the use of the joint term frequencies of pairs of keywords, as a means of characterizing this semantic relationship – this is based on the intuition that terms which frequently appear together are more likely to be closely related. Some of the results reported herein describe: (1) Using appropriate tools and methods, exploitable patterns and information can certainly be extracted from publicly available databases, (2) Adaptation of the Normalized Google Distance (NGD) formalism can provide measures of keyword distances that facilitate keyword clustering and hierarchical visualization, (3) Further adaptation of the NGD formalism can be used to provide an asymmetric measure of keyword distances to allow the automatic creation of a keyword taxonomy, and (4) Adaptation of the Latent Semantic Approach (LSA) can be used to identify concepts underlying collections of keywords.
|