Lehigh University
COLLEGE HOME | LEHIGH HOME | SEARCH




   

Dr. Hiromichi Fujisawa

"Extracting Topics from Web Archives:  Stochastic k-Means Analysis"

Monday, September 25, 4:00 PM

Packard Lab -- Room 416

 

A series of Digital Library initiatives at Stanford University have collected a large archive of web pages including news pages concerning the California Special Election of 2005.  Fully automatic topic analysis of this collection, at the current state of the art, requires executing a sequence of functions including news text body extraction, duplication elimination, meta page elimination, topics extraction by using stochastic k-mean clustering, and topic sentence extraction.   Given a collection of 36,475 such pages, we have automatically identified 1791 unique news pages.  A clustering experiment has identified all 'propositions' arising in that election, plus various other topics.  Stochastic k-Mean Clustering has proven helpful in improving convergence of this analysis toward a more globally optimum solution.

Dr. Fujisawa is Corporate Chief Scientist, responsible for R&D, at Hitachi Central Research laboratories.  He is also Chair of the IEC Technical Committee 105 on fuel cell technology.  He led R&D for the Japanese national postal code readers fielded by Hitachi.  He developed one of the first real-time speech input-output systems for voice robot commands.  He is a Fellow of the IAPR.  He serves on the boards of several journals, is widely published, and has been issued over 50 patents.

     
image


©2008 P.C. Rossin College of Engineering & Applied Science
Computer Science & Engineering, Packard Laboratory, Lehigh University, Bethlehem PA 18015