Indexes

Posted on 2021-02-03

Indexes are well known when it comes to databases. Sooner or later there comes a time when database performance is no longer satisfactory. One of the very first things you should turn to when that happens is database indexing.

The goal of creating an index on a particular table in a database is to make it faster to search through the table and find the row or rows that we want. Indexes can be created using one or more columns of a database table, providing the basis for both rapid random lookups and efficient access of ordered records.

Example: A Library Catalog

A library catalog is a register that contains the list of books found in a library. The catalog is organized like a database table generally with four columns: book title, writer, subject, and date of publication. There are usually two such catalogs: one sorted by the book title and one sorted by the writer name. That way, you can either think of a writer you want to read and then look through their books or look up a specific book title you know you want to read in case you don’t know the writer’s name. These catalogs are like indexes for the database of books. They provide a sorted list of data that is easily searchable by relevant information.

Simply saying, an index is a data structure that can be perceived as a table of contents that points us to the location where actual data lives. So when we create an index on a column of a table, we store that column and a pointer to the whole row in the index. Let’s assume a table containing a list of books, the following diagram shows how an index on the “Title” column looks like:

Data Partitioning

Posted on 2021-02-02

Data partitioning is a technique to break up a big database (DB) into many smaller parts. It is the process of splitting up a DB/table across multiple machines to improve the manageability, performance, availability, and load balancing of an application. The justification for data partitioning is that, after a certain scale point, it is cheaper and more feasible to scale horizontally by adding more machines than to grow it vertically by adding beefier servers.

Partitioning Methods

There are many different schemes one could use to decide how to break up an application database into multiple smaller DBs. Below are three of the most popular schemes used by various large scale applications:

Horizontal Partitioning: In this scheme, we put different rows into different tables. For example, if we are storing different places in a table, we can decide that locations with ZIP codes less than 10000 are stored in one table and places with ZIP Codes greater than 10000 are stored in a separate table. This is also called a range based partitioning as we are storing different ranges of data in separate tables. Horizontal partitioning is also called as Data Sharding.
The key problem with this approach is that if the value whose range is used for partitioning isn’t chosen carefully, then the partitioning scheme will lead to unbalanced servers. In the previous example, splitting location based on their Zip Codes assumes that places will be evenly distributed across the different Zip Codes. This assumption is not valid as there will be a lot of places in a thickly populated area like Manhattan as compared to its suburb cities;
Vertical Partitioning: In this scheme, we divide our data to store tables related to a specific feature in their own server. For example, if we are building Instagram like application - where we need to store data related to users, photos they upload, and people they follow - we can decide to place user profile information on one DB server, friend lists on another, and photos on a third server.
Vertical partitioning is straightforward to implement and has a low impact on the application. The main problem with this approach is that if our application experiences additional growth, then it may be necessary to further partition a feature specific DB across various servers (e.g., it would not be possible for a single server to handle all the metadata queries for 10 billion photos by 140 million users);
Directory Based Partitioning: A loosely coupled approach to work around issues mentioned in the above schemes is to create a lookup service which knows your current partitioning scheme and abstracts it away from the DB access code. So, to find out where a particular data entity resides, we query the directory server that holds the mapping between each tuple key to its DB server. This loosely coupled approach means we can perform tasks like adding servers to the DB pool or changing our partitioning scheme without having an impact on the application;

Caching

Posted on 2021-02-01

Load balancing helps you scale horizontally across an ever-increasing number of servers, but caching will enable you to make vastly better use of the resources you already have as well as making otherwise unattainable product requirements feasible. Caches take advantage of the locality of reference principle: recently requested data is likely to be requested again. They are used in almost every layer of computing: hardware, operating systems, web browsers, web applications, and more. A cache is like short-term memory: it has a limited amount of space, but is typically faster than the original data source and contains the most recently accessed items. Caches can exist at all levels in architecture, but are often found at the level nearest to the front end where they are implemented to return data quickly without taxing downstream levels.

Application Server Cache

Placing a cache directly on a request layer node enables the local storage of response data. Each time a request is made to the service, the node will quickly return local cached data if it exists. If it is not in the cache, the requesting node will query the data from disk. The cache on one request layer node could also be located both in memory (which is very fast) and on the node’s local disk (faster than going to network storage).

What happens when you expand this to many nodes? If the request layer is expanded to multiple nodes, it’s still quite possible to have each node host its own cache. However, if your load balancer randomly distributes requests across the nodes, the same request will go to different nodes, thus increasing cache misses. Two choices for overcoming this hurdle are global caches and distributed caches.

K-Way Merge

Posted on 2021-01-31

min_heap 的应用，特别之处在于借助 index 对元素进行 track。

This pattern helps us solve problems that involve a list of sorted arrays.

Whenever we are given “K” sorted arrays, we can use a min heap to efficiently perform a sorted traversal of all the elements of all arrays. We can push the smallest (first) element of each sorted array in a min heap to get the overall minimum. While inserting elements to the min heap we keep track of which array the element came from. We can, then, remove the top element from the heap to get the smallest element and push the next element from the same array, to which this smallest element belonged, to the heap. We can repeat this process to make a sorted traversal of all elements.

Snippets

min_heap = []
for row in matrix:
    heappush(min_heap, (row[0], 0, row))  # keep track by index

n = len(matrix)
for _ in range(k - 1):
    _, idx, row = heappop(min_heap)
    if idx + 1 < n:
        heappush(min_heap, (row[idx+1], idx + 1, row))

return min_heap[0][0]

LeetCode

23. Merge k Sorted Lists
4. Median of Two Sorted Arrays
378. Kth Smallest Element in a Sorted Matrix
632. Smallest Range Covering Elements from K Lists
373. Find K Pairs with Smallest Sums

Top K Elements

Posted on 2021-01-31

heapq 的大型应用现场，常与 freq 配合使用。借助 index 可以实现对复杂 object 的比较，也可以通过 total_ordering 定义数据类。有时需要用队列暂存 heappop 的元素。

Any problem that asks us to find the Top K Elements among a given set falls under this pattern.

The best data structure that comes to mind to keep track of “K” elements is heap. This pattern will make use of the heap to solve multiple problems dealing with “K” elements at a time from a set of given elements.

Snippets

freq = dict()
for char in s:
    freq[char] = freq.get(char, 0) + 1

max_heap = []
for char in freq:
    heappush(max_heap, (-freq[char], char))

ans = ""
while max_heap:
    queue, k = [], 2  # store pop-up items
    while max_heap and k > 0:
        count, char = heappop(max_heap)
        ans += char

        if -count - 1 > 0:
            queue.append((count + 1, char))
        
        k -= 1
    
    if k > 0 and queue:
        return ""
    
    for item in queue:
        heappush(max_heap, item)  # push back

return ans

LeetCode

692. Top K Frequent Words
215. Kth Largest Element in an Array
973. K Closest Points to Origin
1167. Minimum Cost to Connect Sticks
347. Top K Frequent Elements
451. Sort Characters By Frequency
703. Kth Largest Element in a Stream
658. Find K Closest Elements
1481. Least Number of Unique Integers after K Removals
1508. Range Sum of Sorted Subarray Sums
767. Reorganize String
358. Rearrange String k Distance Apart
621. Task Scheduler
895. Maximum Frequency Stack