29-06-2017 13:30  Graduate Seminar

Effective Distributed Search for New Domains

In this thesis, we tackle key challenges that arise in today's modern, highly scalable, search engine architectures. In particular, we explore fault-tolerant distributed search, similarity search over endless data-streams, and efficient similarity search in peer-to-peer networks. We propose algorithms that better exploit the infrastructure's resources in order to improve search effectiveness. We achieve improvements by considering the internals of the search algorithms in use, rather than using them as black boxes. We develop a probabilistic model for analyzing our algorithms and proving their superiority over existing approaches. We further conduct empirical evaluations using real-world datasets that support our claims. The topics covered in this thesis are summarized as follows. Fault-tolerant distributed search: We introduce a novel approach for constructing and searching a distributed index with redundancy in the presence of failures. We propose rSmartRed, an optimal strategy for selecting the number of node replicas to search over at runtime, which considers each node's likelihood to contain results that are relevant to the query, as well as failure probability. In addition, when feasible, we propose to replace replication with repartition, which constructs independent index instances instead of exact copies. Our fault-tolerant distributed search improves search effectiveness upon failures compared to naively using replication as a black box. Similarity search over endless data-streams: We present Stream-LSH, a similarity search algorithm that uses a bounded index for indexing unbounded data. We propose a randomized policy for dynamically maintaining items in the index, which takes into account items' age, quality, and popularity attributes. We show that Stream-LSH better exploits capacity resources which improves search effectiveness compared to prior art. Efficient similarity search in peer-to-peer networks: We present NearBucket-LSH, an effective algorithm for similarity search in large-scale distributed online social networks organized as peer-to-peer overlays. As communication is a dominant consideration in distributed systems, we focus on minimizing the network cost while guaranteeing good search quality. We decrease the network cost by considering the internals of the similarity search and the peer-to-peer architecture, and harnessing their properties to our needs.

Location: 1061
Speaker: Naama Kraus
Affiliation: Dept. of Electrical Engineering Technion Back