Facebook is a large social media networking platform that was founded in 2004 by Mark Zuckerberg, Eduardo Saverin, Dustin Moskovitz, and Chris Hughes. Facebook is now one of the top five technology companies, alongside Microsoft, Amazon, Apple, and Google.Facebook's name has recently been changed to Meta.
Facebook is one of the most rapidly growing social media platforms which you can see from the graph below. In the third quarter of 2021, it has ~2910M, active users.
According to Facebook's most recent update report on March 13, 2021, the total number of photos uploaded by users has surpassed 10 billion. Also, Every day, 2-3 TB of photos are uploaded to Facebook. Every day, Facebook serves over 15 billion photo images. Photo traffic now exceeds 300,000 images per second.
If you want to know what programming languages and tech stack Facebook uses to handle such massive traffic, and if you want to build an app like Facebook, you should know about technology uses, frontend, backend, storage, and web server used, how they handle videos, how they scale the product, how realtime functionality is adopted, which cloud services they use, and so on.
Table of Contents
Programming Language used by Facebook
There are various programming languages used by Facebook. Some of them are PHP, PHP Hack, C++, Erlang, etc.
Facebook's first programming language was PHP. PHP is an abbreviation for PHP: Hypertext Preprocessor. It is a popular open-source general-purpose scripting language with a focus on web development. But the main problem with PHP at Facebook was scalability. So Facebook also utilizes other languages. Zend Engine is used to power this standard PHP-based application.
Hip-hop for PHP is a source-to-source compiler that converts PHP script into optimized C++ and then compiles it into machine code using G++ which further increases performance. So, ultimately Facebook HipHop Compiler improves Facebook performance.
HipHop for PHP was discontinued in 2013 and was replaced by the HipHop Virtual Machine (HHVM) which we discussed in PHPHack.
PHPHack is a new programming language created in 2014 by Facebook Engineers. It looks like PHP and has PHP features as well as additional features and enhancements such as type checking, refactoring, nullable type checking, collections, and better use of asynchronous programming (it allows you to start multiple tasks that run in parallel).
These features have been proposed by PHP developers for a long time but have yet to be implemented and Hacks puts this in action as Facebook has made significant investments in HHVM and Hack.
In terms of performance, the HHVM (HipHop Virtual Machine) engine outperforms the Zend Engine. Hack language, JIT compilation, FastCGI support, HNI, hphpd debugger, and other features are supported by the HHVM Engine.
Also, you can migrate all PHP code to Hack or fix any Hack code even with regular PHP code. But any Hack code will only run on HHVM and HHVM supports and interprets both Hack code as well regular PHP code.
The Facebook chat logging module is written in C++ and logs information between UI page loads. User Presence Module is written in C++ and provides information such as the online availability of user connections.
The User Presence Module aggregates the online information of users in memory and sends it to the client when necessary. (source)
Message Queuing and delivery functionality of facebook are written in Erlang. Erlang is a concurrent functional programming language with high availability, real-time scalability, and fault tolerance.
Database used by Facebook
Facebook stores structured data such as likes, comments, and shares in a persistent database called MySQL. MySQL is Facebook's primary database for storing all social data. In Facebook, MySQL uses InnoDB as a storage engine, but it is inefficient in data compression, so it takes up more space in the database. As a result, Facebook later created MyRocksDB and integrated RocksDB as a new MYSQL storage engine.
MyRocks compresses data better than InnoDB, which can compress more data before saving it to a MySQL database. According to Facebook, MYRocksDB saves approximately 50% more space than InnoDB.Both InnoDB and MyRocksDB are used as storage engines in Facebook.
Benefits of MyRocksDB
- Greater space efficiency: Takes up less storage space than InnoDB.
- Improved writing efficiency: Less write amplification is required when compared to InnoDB.
- Allows for faster replication.
- Quicker data loading
Apache Cassandra is a NoSQL database management system created by Apache, and it was used by Facebook for inbox searches.
The goal was to create a distributed storage system dedicated to managing a large amount of structured data across multiple commodity servers without a single failure. Its characteristics include scalability, high performance, and high availability.
Caching is a practice of retrieving stored data with high performance and caching is introduced for reducing the load on the server for frequent calls to the content.
Memcache is a distributed memory caching system that assists in the caching of user requests. The users first request data from the cache, and if the data is found there, it will respond from the cache; if no data is found there, it will look in the database.
So, you will hit the database server only if no data available in Cache. To a large extent, this reduces database server overload and latency.
Big Data in Facebook
Offline processing is done using Hadoop and Hive.
Apache Hadoop is used by Facebook to run analytics, distribute storage, and store MySQL database backups.
Apache Hive is a Hadoop-based data warehouse software project for running data queries and analytics. It converts SQL-style queries into Map Reduce jobs. It can easily handle larger joins.
Apache Hbase is used to store Facebook messages. It is an Apache open source project for large-scale distributed databases (similar to non-relational database models) built on top of the Hadoop Distributed File System (HDFS)
Large data sets can be stored on top of HDFS file storage, and billions of rows in HBase tables can be aggregated and analyzed.
PrestoDB is a free open-source distributed SQL query engine that can run interactive queries in multiple stages concurrently against multiple internal data stores as well as large petabytes of data in the data warehouse. Because the queries are running concurrently, they are much faster. PrestoDB is used by Facebook to process data in their Hive warehouse via a massive batch pipeline workload.
Presto can work with both non-relational databases (Hadoop Distributed File System (HDFS), Amazon S3, Cassandra, MongoDB, and HBase) and non-relational (MySQL, PostgreSQL, Amazon Redshift, Microsoft SQL Server, and Teradata are a few examples.)
One of the most popular features on Facebook is photo sharing. We can share photos on Facebook, which uses Haystack.
Facebook is a high-performance photo storage/retrieval system for sharing photos on Facebook, in which data is written once, read frequently, and never deleted or modified.
Haystack reads data with high throughput and low latency using only one disk operation. Haystack has excellent fault tolerance, with photos replicated across multiple geographical areas in the event of a server failure in a single location. It is also less expensive than the traditional NFS-based approach.
As previously discussed, Facebook serves over 15 billion photo images per day. Facebook also saves four different resolutions of each photo. Each photo on Facebook has metadata, which allows the Haystack storage machine to perform a metadata scan in the main memory and quickly retrieve the required photos. This option saves disk operations for reading actual data, increasing overall throughput.
Varnish is an open-source HTTP accelerator that can act as a load balancer, cache content, and serve photos, profile pictures, and other media at lightning speed.
Apache Thrift is an RPC framework that enables the communication between various services running on widely divergent technologies and languages such as C++, Erlang, PHP, and C#, etc. As a result, Apache thrift provides cross-language serialization.
Scribe is a free and open-source log server that aggregates log data streamed in real-time from a variety of other servers. It can be used to record a wide range of data. It is constructed on top of apache thrift