![]() |
|
||||||||||||
|
|||||||||||||
|
|
Lehigh CSC 2002 Technical ReportsLU-CSE-02-001Detection of Emerging Trends: Automation of Domain Expert PracticesDavid Gevry The automatic detection of emerging trends is an important research area in the field of textual data mining. Explosive increases in technology require the development of precise forecasting tools to drill down into the textual artifacts of various research communities and reveal emerging innovations to technology planners and investors. This thesis presents two manual methodologies the have been developed in the study of approaches to the task of emerging trend detection. These methods are then integrated together in order to improve the overall precision of each. The performance of this combined methodology is evaluated using the standard metric of precision and it is shown with a confidence of 95% that the usage of this methodology improves precision for the detection of emerging trends. The overall goal of this research is the automation of various domain expert trend detection practices and integration of these modules into a fully automated system. The first two methods of this automation are presented and their usability and performance are evaluated. We show that these tools aid in increasing the efficiency of the task of emerging trend detection and propose improvement for these tools as well as future plans for the full automation of the presented methodology. PDF(104 pages, 338KB) LU-CSE-02-002The Benefits and Drawbacks of HTTP CompressionTimothy J. McLaughlin HTTP compression addresses some of the performance problems of the Web by attempting to reduce the size of resources transferred between a server and client thereby conserving bandwidth and reducing user perceived latency. Currently, most modern browsers and web servers support some form of content compression. Additionally, a number of browsers are able to perform streaming decompression of gzipped content. Despite this existing support for HTTP compression, it remains an underutilized feature of the Web today. This can perhaps be explained, in part, by the fact that there currently exists little proxy support for the Vary header, which is necessary for a proxy cache to correctly store and handle compressed content. To demonstrate some of the quantitative benefits of compression, we conducted a test to determine the potential byte savings for a number of popular web sites. Our results show that, on average, 27% byte reductions are possible for these sites. Finally, we provide an in depth look at the compression features that exist in Microsoft's Internet Information Services (IIS) 5.0 and the Apache 1.3.22 web server and perform benchmarking tests on both applications. PDF (19 pages, 961KB) LU-CSE-02-003HTTPflow: An IntroductionLev Grevnin and Brian D. Davison Internet performance measurement is becoming increasingly important. As more computers join this global network, downloads, gaming, online presentations and even simple messaging may all experience considerable lag times and communication errors. The current work breaks off the piece of the bigger problem and attempts to take a peek into the performance of the HTTP protocol, the driving force behind Web browsing. Being able to analyze HTTP streams for performance can provide a glimpse into this problem for the network under investigation. This paper introduces HTTPflow which captures packets and extracts HTTP-specific information on a per-TCP flow basis. HTTPflow examines all port 80 traffic and extracts HTTP headers and timings for requests and responses. PDF (12 pages, 126KB) LU-CSE-02-004IPQ: A Communication System for Distributed Virtual Environments with a Network Protocol using a Semi-optimistic, Sender Initiated Acknowledgement SchemeScott Frees and G. Drew Kessler Virtual Environment (VE) applications are generally comprised of a number of different tasks that are responsible for such things like tracking user motion, rendering screen output, or performing collision detection in the environment. These tasks are of different priority and often require different computing requirements. It is for this reason that VE applications lend themselves to a distributed computing environment. Since producing an effective VE application is greatly dependent on the speed of the system, the efficiency of information passing between tasks becomes a very important concern. It is out of this concern that the Inter-Process-Queue (IPQ) was developed. IPQ works under the idea that many of the messages sent between VE tasks are simply state updates and a task on the receiving end of these messages is only concerned with the most recent updates. By implementing an information passing abstraction called an updateable queue, IPQ automatically drops obsolete and extraneous state updates while en route to the receiving task. In addition to a detailed overview of the basic ideas and structure of the IPQ system, we present the new networking protocol used by IPQ to communicate with remote machines. We compare two implementations of a reliable UDP protocol using sender initiated acknowledgment schemes. The first implementation is a more pessimistic approach, requiring an acknowledgment to be sent each time a message is received. The second, new and more optimistic approach, uses cumulative positive and negative acknowledgment scheme to cut down on network traffic. We present performance results comparing these two protocols along with tests comparing TCP against IPQ using this new protocol. PDF (40 pages, 1020KB) LU-CSE-02-005Transitivity and the Co-occurrence Relation in LSIApril Kontostathis and William M. Pottenger Current research in Latent Semantic Indexing (LSI) shows improvements in performance for a wide variety of Information Retrieval systems. Researchers use experimental methods to determine the appropriate number of dimensions for a given application. We propose the development of a theoretical foundation for determination of this parameter for LSI. We assert that LSI’s use of higher orders of co-occurrence is critical to this optimization function. In this work we present experiments that precisely determine the degree of transitivity used in LSI. We empirically demonstrate that LSI uses up to fourth order term co-occurrence. We also prove mathematically that a transitivity path exists for every nonzero element in the truncated term-term matrix computed by LSI. A complete understanding of the degree of transitivity will be key to understanding how a reflexive, symmetric and transitive relation based on the co-occurrence relation can form semantic equivalence classes for a collection. PDF(9 pages, 61KB) LU-CSE-02-006A Mathematical View of Latent Semantic Indexing: Tracing Term Co-occurrencesApril Kontostathis and William M. Pottenger Current research in Latent Semantic Indexing (LSI) shows improvements in performance for a wide variety of information retrieval systems. We propose the development of a theoretical foundation for understanding the values produced in the reduced form of the term-term matrix. We assert that LSI’s use of higher orders of co-occurrence is a critical component of this study. In this work we present experiments that precisely determine the degree of co-occurrence used in LSI. We empirically demonstrate that LSI uses up to fifth order term co-occurrence. We also prove mathematically that a connectivity path exists for every nonzero element in the truncated term-term matrix computed by LSI. A complete understanding of this term transitivity is key to understanding LSI. PDF(11 pages, 60KB) LU-CSE-02-007The Effect of Avatar Connectedness on Task Performance.John M. Linebarger and G. Drew Kessler Previous research has examined the relationship between avatar representation and the sense of presence and co-presence in a shared virtual environment. A positive correlation has been found between the realism of the representation and both categories of presence. However, to our knowledge no research has focused explicitly on the effect of avatar representation in general and connectedness or embodiment in particular on task performance in a virtual environment. Three experiments are described which directly address this issue. Two of the experiments involve the representation of the self avatar in an immersive virtual environment in performing repetitive, semi-precise and non-repetitive, precise tasks. The third addresses the representation of avatars of other users, in both immersive and desktop virtual environment interfaces, in performing direct and avatar-mediated object-focused tasks. Four avatar representations were tested, each of which was either "connected" (i.e., embodied) or not, and "correlated" (i.e., color coded) or not. No significant difference in task completion time was observed between comparable self avatar representations in either task category, or between the representations of avatars of other users in direct object-focused tasks. For avatar-mediated object-focused tasks the representation was significant, with correlation having a much greater impact than connection on task completion times. Thus simpler, less computationally expensive avatar representations are quite adequate for task performance in a virtual environment for certain kinds of tasks. PDF (8 pages, 547KB) LU-CSE-02-008Error-Driven Boolean-Logic-Rule-Based Learning for Mining Chat-room ConversationsTianhao Wu, Faisal M. Khan, Todd A. Fisher, Lori A. Shuler and William M. Pottenger The ephemeral nature of human communication via networks today poses interesting and challenging problems for information technologists. The sheer volume of communication in venues such as email, newsgroups, and chat precludes manual techniques of information management. Currently, no systematic mechanisms exist for accumulating these artifacts of communication in a form that lends itself to the construction of models of semantics. In essence, dynamic techniques of analysis are needed if textual data of this nature is to be effectively mined.At Lehigh University we are developing a text mining tool for analysis of chat- room conversations. Project goals concentrate on the development of functionality to answer questions such as "What topics are being discussed in a chat-room?", "Who is discussing which topics?" and "Who is interacting with whom?" The objective is to develop technology that can automatically identify such patterns of interaction in both social and semantic terms. In this article we present our preliminary findings for a novel technique developed to identify threads of conversation in multi-topic, multi-person chat-rooms. This is the first step towards building models of social and semantic interaction. We term our technique Error-Driven Boolean-Logic-Rule- Based Learning (BLogRBL), a variation on Brill's Transformation Based Learning. Similar to Brill's method, rules are automatically derived from templates during learning. It differs from Brill's technique in that rules take the form of complex expressions of combinational logic. We report on the scope and design of our technique, as well as discussing preliminary results. PDF(13 pages, 68KB) LU-CSE-02-009Improving Retrieval Performance with Positive and Negative Equivalence Classes of TermsApril Kontostathis and William M. Pottenger One of the most pressing problems facing application developers in the area of information retrieval (IR) is the lack of sound mathematical, theoretical frameworks for understanding IR [SIGIR2000]. Although many such frameworks have been proposed, in the final analysis none has been sufficiently well-grounded to attain widespread acceptance in the field. In addition, there is all too often a lack of empirically sound evaluation of such frameworks in an actual application. For this reason we have forayed into the theoretical domain of IR, while at the same time grounded our work in an application of widespread importance, search and retrieval. One need only glance at the statistics of the hit counts of the latest search engines to realize just how important search and retrieval has become. In this paper we present a novel approach to term clustering and its application in improving the performance of search and retrieval. Our approach is firmly grounded in a theoretical framework that we have developed. Term clustering is an approach that researchers have used to convert the original words of a document into more effective content identifiers. Term clustering algorithms generally consist of two phases. In the first phase term-term similarity is determined. The second phase uses the term-term similarities to develop clusters of terms. Latent Semantic Indexing (LSI) [Deerwester, et al - 1990] is a well-known information retrieval algorithm that is based on Singular Value Decomposition (SVD). The values in the truncated term-term matrix produced by SVD can be treated as similarity measures for input to a clustering algorithm. In this work we present an algorithm that produces clusters of terms that improve retrieval performance (as measured by precision and recall). We assume that the value in position (i,j) of the term-term matrix represents the similarity between term i and term j in the collection. By extension, a negative value represents an anti-similarity between term i and term j. Our approach searches for both positive and negative clusters of terms. We show that the positive clusters, when used to expand an initial query, result in significant improvements in recall for a given collection. Furthermore, the negative clusters, when used to prune the result set, result in significant improvements in precision. To our knowledge, these are the first significant results that show that anti-similarity clusters exist and can be used to improve performance of search and retrieval in IR.
PDF (11 pages, 60KB) LU-CSE-02-010Detecting Patterns in the LSI Term-Term MatrixApril Kontostathis and William M. Pottenger Higher order co-occurrences play a key role in the effectiveness of systems used for text mining. A wide variety of applications use techniques that explicitly or implicitly employ a limited degree of transitivity in the co-occurrence relation. In this work we show use of higher orders of co-occurrence in the Singular Value Decomposition (SVD) algorithm and, by inference, on the systems that rely on SVD, such as LSI. Our empirical and mathematical studies prove that term co-occurrence plays a crucial role in LSI. This work is the first to study the values produced in the truncated term-term matrix, and we have discovered an explanation for why certain term pairs receive a high similarity value, while others receive low (and even negative) values. Thus we have discovered the basis for the claim that is frequently made for LSI: LSI emphasizes important semantic distinctions (latent semantics) while reducing noise in the data. The correlation between the number of connectivity paths between terms and the value produced in the truncated term-term matrix is another important component in the theoretical foundation for LSI. Patterns we discover in the LSI term-term matrix will be used, in future work, to develop of an approximation algorithm for LSI. Our goal is to approximate the LSI term-term matrix using a faster algorithm. This matrix can then be used in place of the LSI matrix in a variety of applications, such as our unsupervised term clustering algorithm.
PDF(10 pages, 60KB) LU-CSE-02-011Mining Chat-room Conversations for Social and Semantic InteractionsFaisal M. Khan, Todd A. Fisher, Lori Shuler, Tianhao Wu and William M. Pottenger The ephemeral nature of human communication via networks today poses interesting and challenging problems for information technologists. The Intelink intelligence network, for example, has a need to monitor chat-room conversations to ensure the integrity of sensitive data being transmitted via the network. However, the sheer volume of communication in venues such as email, newsgroups, and chat precludes manual techniques of information management. It has been estimated that over 430 million instant messages, for example, are exchanged each day on the America Online network [3]. Although a not insignificant fraction of such data may be temporarily archived (e.g., newsgroups), no systematic mechanisms exist for accumulating these artifacts of communication in a form that lends itself to the construction of models of semantics [12]. In essence, dynamic techniques of analysis are needed if textual data of this nature is to be effectively mined. This article reports our progress in developing a text mining tool for analysis of chat-room conversations. Central to our efforts is the development of functionality to answer questions such as "What topics are being discussed in a chat-room?", "Who is discussing which topics?" and "Who is interacting with whom?" The objective of our research is to develop technology that can automatically identify such patterns of interaction in both social and semantic terms. In this article we report our preliminary findings in identifying threads of conversation in multi-topic, multi-person chat-rooms. We have achieved promising results in terms of precision and recall by employing pattern recognition techniques based on finite state automata. We also report the design of our approach to building models of social and semantic interactions based on our HDDI(TM) text mining infrastructure [13]. PDF(10 pages, 173KB)LU-CSE-02-012A Split Stack Approach to Mobility-Providing Performance-Enhancing ProxiesBrian D. Davison, Kiran Komaravolu, and Baoning Wu Many varieties of performance-enhancing proxies (PEPs) have been proposed to improve TCP performance and/or provide seamless mobility. One simple, albeit limited technique is the application-layer proxy. It too can isolate degraded last-mile link performance from the transmission of data across the rest of the network. While Web services on proxies are common, not all network applications will fit the traditional proxy model. We suggest using the proxy in a fashion that is fully compatible with all applications. In the split stack approach, the application runs on the client, but client networking library calls are executed on the proxy. The split stack architecture provides three major benefits, even with limited deployment: seamless handling of last-mile disconnects; mobility without MobileIP; and improved network performance. This paper outlines the split stack architecture, relates it to previous techniques, and provides some estimates of potential performance improvement. PDF(13 pages, 390KB)
|
|||||||
![]() |
||||||||
|
|