Lehigh University
COLLEGE HOME | LEHIGH HOME | SEARCH




   

Michele Kimpton

"Saving the Web for Future Generations"

Thursday, November 30, 4:00 PM

Packard Lab -- 466

 

The Internet Archive (IA) is building a public Internet digital library.  Over the last 6 years, it has gathered the largest public web archive in the world, containing billions of web pages from over 35 million sites, for a total of over a petabyte (10^15 bytes) of data.

Attempting to archive the entire publicly available web appears to be such an enormous task that many believe it is impossible, and some question the value of even trying.  However, archiving selectively would also be a challenging task due to the web’s everchanging content and seamless borders. The IA has decided that a policy of selection is more expensive today, and far riskier for users in the future, than saving it all.

I will describe the IA’s approach to archiving the global web:  the techniques we use to collect, store, and provide access to the data; the challenges we face in collecting and preserving billions of web pages; and the many benefits (and some unobvious compromises) inherent in large-scale web archiving.


 

     
image


©2008 P.C. Rossin College of Engineering & Applied Science
Computer Science & Engineering, Packard Laboratory, Lehigh University, Bethlehem PA 18015