How Should You Design Database For Twitter
Twitter currently has 300 million active users globally. On Twitter, around 6,000 tweets are sent per second. In addition, every second 6,00,000 queries are performed to get the timeframes!!
What are the user features?
The user must be able to tweet as quickly as feasible.
Tweet Timeline should be visible to the user (s)
User timeline: Displays the user’s tweets as well as the tweets that the user retweets.
Home timeline: Tweets from individuals the user follows are shown.
Timeline of the search: Show search results based on #tags or the search phrase
The user should have the ability to follow another user.
Users should be able to tweet to millions of followers in a matter of seconds (5 seconds)
Trends should be visible to the user.
Before creating Twitter, consider the following:
When you consider all of the qualities, it seems that read is more important than write.
It’s OK to have eventual uniformity; it’s not a big deal if the user reads his follower’s tweet a little later.
Because tweets are restricted to 140 characters, space is not an issue.
So, let’s figure out how to create alternative timeframes.
Because we know the system is read-heavy, let’s utilize REDIS to access most of the information quicker and to save data, but don’t forget to keep a copy of the tweet and other user-related information in the database.
The Twitter service’s basic architecture consists of a User Table, a Tweet Table, and a Followers Table.
User data is saved in the User Table.
When a user tweets, the tweet is saved in the Tweet Table along with the user’s ID.
Tweet Table will have one or more associations with User Table.
When a user follows another user, it is recorded in the Followers Table and cached in Redis.
So, how do you create a USER TIMELINE? It is shown below
Retrieve all tweets for a certain user from the Global Tweet Table/Redis.
Retweets are also included; preserve retweets as tweets with the original tweet reference.
Display it on the user timeline and sort it by date and time.
As an optimization, store user timelines in the cache; for example, celebrities’ timelines are viewed millions of times, so it’s not much use getting it from the database all the time.
Timeline of the house
Strategy No. 1:
Get a list of all the people that this user is following (from followers table)
Retrieve tweets from the global tweet database for all users.
Show it on your home timeline.
The disadvantage is that this massive search process on a relational database is NOT scalable. Though we may utilize sharing and other techniques, this massive search process will take longer as the tweet database increases to millions of records.
Strategy2:
A write-based fanout strategy is used to solve the problem. When tweets come, do a lot of processing to find out where they should go. This allows you quick and simple access to read time. Don’t do any calculations on readings. Because all of the work is done on the write lane, ingest rates are slower than on the read path. So, for each active user, calculate timelines and store them in the cache, so that when the user enters the home timeline, all you have to do is acquire the timeline and display it.
You could save this data in a database, but what’s the point?
As an example of a basic twitter flow, consider the following:
Tweeted by user A
Tweets will be sent to back-end servers through Load Balancer.
The server node will store the tweet in the database/cache.
From the cache, the server node will get all of the people that follow User A.
This tweet will be injected into the in-memory timelines of his followers by the server node.
Eventually, all of User A’s followers will see User A’s tweet in their timeline.
BUT, does it always work? Is it effective?
If you look at celebrities, you will see that they have millions of followers.
Taylor Swift’s tweets might take up to 4 minutes to reach her 83 million followers through the Twitter system.
So, if someone with millions of Twitters followers’ tweets, we have to update millions of lists, which is not particularly scalable. So, here’s what we can do:
User A’s home timeline was precomputed with everyone excluding famous tweets (s)
At load time, when User A accesses his timeline, his twitter feed is blended with celebrity tweets.
Have a mapping of celebs list for each user and mix their tweets at runtime as the request occurs and cache it.
What further enhancements are possible?
Do not calculate a timeline for inactive users who have not logged in to the system in more than 15 days!!!
So, how are Trends and Trending Topics calculated?
Twitter computes trending topics using the Apache Storm and Heron frameworks.
These operations are executed on a large number of containers. These apps generate a real-time analysis of all tweets sent on the Twitter social network, which may be used to identify so-called hot topics.
Essentially, the process is calculating the most often cited phrases in the posted tweets on the Twitter social network.
The approach is known as the “Trending Hashtags” method in the realm of social network data analysis. Consider two topics A and B; the statement that A is more popular than B is similar to the fact that the number of references of subject A exceeds the number of mentions of subject B.
Information is required.
The number of times a topic has been mentioned (hashtags)
Total amount of time required to create that number of tweets
Tweet Spout: Represents a component in the topology that is used to send tweets.
Tweet Filter Bolt: Reads the tweets sent by the Tweet Spout and filters them. Only tweets with standard Unicode coded messages will be considered. In addition, violation and CC checks are performed.
Parse Tweet Bolt: Processes the filtered tweets delivered as tuples by the Tweet Filter Bolt component. Given that the tuple is filtered, we can be certain that each tweet has at least one hashtag at this level.
Count Hashtag Bolt: Counts each hashtag in the tweets processed by the component Parse Tweet Bolt. This is to obtain the hashtag as well as the number of references to it.
Total Ranker Bolt: Compiles a total ranking of all hashtags that have been tallied. In another pipeline, transforms count to rankings.
Geo Location Bolt: It combines the hashtag provided by the Parse Tweet Bolt with the tweet’s location.
Hash tag Bolt: Similar to the component Count, Hash tag provides functionality.
Hashtag Bolt adds a new component, namely location.
Redis Bolt: inserts into the Redis database.
Searching:
Early Bird makes use of an inverted full-text index. This means it takes all of the documents, divides them into words, and then creates an index for each word. Because the index is a precise string-match, it may be exceedingly quick. An SQL unordered index on a varchar field may theoretically be just as fast, and I believe large databases can do a basic string-equality query quite rapidly in that situation.
It is not necessary for Lucene to optimize for transaction processing. When you add a document, it is not necessary for queries to view it immediately. It also does not have to optimize for modifications to existing documents.
At the end of the day, if you really want to know, you must read the source. After all, both of the items you mention are open source.
It must disperse across the datacenter. It asks each Early Bird shard whether they contain material that fits this query. When you query “New York Times,” all shards are searched, and the results are sorted, merged, and reranked. Reranking is based on social evidence, which includes the amount of retweets, likes, and responses.
Databases:
Gizzard is Twitter’s distributed data storage system based on MySQL Because it does not alter data, in no DB was selected. Gizzard is only a data storage facility. Data is fed in, and it is returned to you. Many features, such as binary logs and replication, are disabled to improve efficiency on individual nodes. Gizzard is in charge of sharing, duplicating N copies of the data, and scheduling jobs.
Cassandra is used for high-speed writes and low-speed reads. Cassandra has the benefit of being able to operate on less expensive hardware than MySQL, it is simpler to scale, and they prefer schemeless architecture.
Hadoop is used to handle enormous, unstructured datasets with hundreds of billions of rows.
Vertica is used for analytics and massive aggregations and joins, eliminating the need for MapReduce tasks.
Wrap Up
I hope this post has helped you understand how Twitter works, if not perfectly.