搜索引擎背后的数据结构和算法

Posted on 2021-03-23

整体系统介绍

搜索引擎大致可以分为四个部分：搜集、分析、索引、查询：

搜集，就是我们常说的利用爬虫爬取网页；
分析，主要负责网页内容抽取、分词，构建临时索引，计算 PageRank 值这几部分工作；
索引，主要负责通过分析阶段得到的临时索引，构建倒排索引；
查询，主要负责响应用户的请求，根据倒排索引获取相关网页，计算网页排名，返回查询结果给用户；

搜集

搜索引擎把整个互联网看作数据结构中的有向图，把每个页面看作一个顶点。如果某个页面中包含另外一个页面的链接，那我们就在两个顶点之间连一条有向边。我们可以利用图的遍历搜索算法，来遍历整个互联网中的网页。我们前面介绍过两种图的遍历方法，深度优先和广度优先。搜索引擎采用的是广度优先搜索策略。具体点讲的话，那就是，我们先找一些比较知名的网页（专业的叫法是权重比较高）的链接（比如新浪主页网址、腾讯主页网址等），作为种子网页链接，放入到队列中。爬虫按照广度优先的策略，不停地从队列中取出链接，然后去爬取对应的网页，解析出网页里包含的其他网页链接，再将解析出来的链接添加到队列中。

爬虫在爬取网页的过程中，涉及的四个重要的文件。其中，links.bin 和 bloom_filter.bin 这两个文件是爬虫自身所用的；另外的两个 doc_raw.bin 和 doc_id.bin 是作为搜集阶段的成果，供后面的分析、索引、查询用的。

待爬取网页链接文件：links.bin

我们可以把整个页面看作一个大的字符串，然后利用字符串匹配算法，在这个大字符串中，搜索这样一个网页标签，然后顺序读取之间的字符串 – 这其实就是网页链接。在广度优先搜索爬取页面的过程中，爬虫会不停地解析页面链接，将其放到队列中。于是，队列中的链接就会越来越多，可能会多到内存放不下。所以，我们用一个存储在磁盘中的文件（links.bin）来作为广度优先搜索中的队列。爬虫从 links.bin 文件中，取出链接去爬取对应的页面。等爬取到网页之后，将解析出来的链接，直接存储到 links.bin 文件中。这样用文件来存储网页链接的方式，还有其他好处。比如，支持断点续爬。也就是说，当机器断电之后，网页链接不会丢失；当机器重启之后，还可以从之前爬取到的位置继续爬取。

Redis 常用数据类型对应的数据结构

Posted on 2021-03-23

Redis 数据库介绍

Redis 是一种键值（Key-Value）数据库。相对于关系型数据库（比如 MySQL），Redis 也被叫作非关系型数据库。像 MySQL 这样的关系型数据库，表的结构比较复杂，会包含很多字段，可以通过 SQL 语句，来实现非常复杂的查询需求。而 Redis 中只包含“键”和“值”两部分，只能通过“键”来查询“值”。正是因为这样简单的存储结构，也让 Redis 的读写效率非常高；除此之外，Redis 主要是作为内存数据库来使用，也就是说，数据是存储在内存中的。尽管它经常被用作内存数据库，但是，它也支持将数据存储在硬盘中。Redis 中，键的数据类型是字符串，但是为了丰富数据存储的方式，方便开发者使用，值的数据类型有很多，常用的数据类型有这样几种，它们分别是字符串、列表、字典、集合、有序集合。

Additional Resources

Posted on 2021-02-28

Here are some useful links for further reading:

Dynamo - Highly available key-value store;
Kafka - A distributed messaging system for log processing;
Consistent Hashing - Original paper;
Paxos - Protocol for distributed consensus;
Concurrency Controls -
Optimistic methods for concurrency controls;
Gossip Protocol - For failure detection and more;
Chubby - Lock service for loosely-coupled distributed systems;
ZooKeeper - Wait-free coordination for Internet-scale systems;
MapReduce - Simplified data processing on large clusters;
Hadoop - A distributed file system;

Designing Ticketmaster

Posted on 2021-02-28

What is an Online Movie Ticket Booking System?

A movie ticket booking system provides its customers the ability to purchase theatre seats online. E-ticketing systems allow the customers to browse through movies currently being played and to book seats, anywhere anytime.

Requirements and Goals of the System

Our ticket booking service should meet the following requirements:
Functional Requirements:

Our ticket booking service should be able to list different cities where its affiliate cinemas are located;
Once the user selects the city, the service should display the movies released in that particular city;
Once the user selects a movie, the service should display the cinemas running that movie and its available showtimes;
The user should be able to choose a show at a particular cinema and book their tickets;
The service should be able to show the user the seating arrangement of the cinema hall. The user should be able to select multiple seats according to their preference;
The user should be able to distinguish available seats from booked ones;
Users should be able to put a hold on the seats for five minutes before they make a payment to finalize the booking;
The user should be able to wait if there is a chance that the seats might become available, e.g., when holds by other users expire;
Waiting customers should be serviced in a fair, first come, first serve manner;

Non-Functional Requirements:

The system would need to be highly concurrent. There will be multiple booking requests for the same seat at any particular point in time. The service should handle this gracefully and fairly;
The core thing of the service is ticket booking, which means financial transactions. This means that the system should be secure and the database ACID compliant;

Designing Uber backend

Posted on 2021-02-27

What is Uber?

Uber enables its customers to book drivers for taxi rides. Uber drivers use their personal cars to drive customers around. Both customers and drivers communicate with each other through their smartphones using the Uber app.

Requirements and Goals of the System

Let’s start with building a simpler version of Uber.
There are two types of users (Drivers, Customers) in our system:

Drivers need to regularly notify the service about their current location and their availability to pick passengers;
Passengers get to see all the nearby available drivers;
Customer can request a ride; nearby drivers are notified that a customer is ready to be picked up;
Once a driver and a customer accept a ride, they can constantly see each other’s current location until the trip finishes;
Upon reaching the destination, the driver marks the journey complete to become available for the next ride;

Capacity Estimation and Constraints

Let’s assume we have 300M customers and 1M drivers with 1M daily active customers and 500K daily active drivers;
Let’s assume 1M daily rides;
Let’s assume that all active drivers notify their current location every three seconds;
Once a customer puts in a request for a ride, the system should be able to contact drivers in real-time;