Hbase is an apache top open source project separated from hadoop. Because it implements most of the functions of google's bigtable system in java, it is very popular today with the rapid increase of data. For Taobao, with the expansion of market scale, the development of products and technology, the amount of business data is increasing, and the efficient insertion and reading of massive data is becoming more and more important. Because Taobao has perhaps the largest single hadoop cluster (ladder) in China and has a deep understanding of hadoop products, it naturally hopes to use hbase to do such massive data reading and writing services. This paper will summarize the use and optimization of hbase in online application of Taobao in the past year.
Two reasons
Why use hbase?
Before 20 1 1, all the back-end persistent storage of Taobao was basically carried out on mysql (not excluding a small number of Oracle/BDB/Tail/MongDB, etc. Mysql has met the needs of a large number of merchants in Taobao for a long time because of its open source, good ecosystem and various solutions such as sub-database and sub-table.
However, due to the diversified development of business, the requirements of more and more business systems began to change. Generally speaking, there are the following types of changes:
A) the amount of data is increasing. In fact, almost any online business related to users in Taobao has a data volume of one billion, and the daily system calls range from one billion to one billion, so historical data cannot be easily deleted. This requires a large-scale distributed file system that can provide online services for TB-level or even PB-level data.
B) The amount of data is increasing rapidly, which may not be accurately predicted. Most application systems have a rapid upward trend for a period of time after going online. Therefore, from the cost point of view, there is a strong demand for the horizontal expansion ability of the system, and there is no single point constraint.
C) Only simple kv readings are required, and there are no complicated connection requirements. However, there are very high requirements for the concurrency, throughput and response delay of the system, and it is hoped that the system can maintain strong consistency.
D) Usually, the system writes frequently, especially a large number of systems rely on real-time log analysis.
E) Want to read batch data quickly.
F) The schema is flexible, and columns can be frequently updated or added.
G) I hope it is easy to use, with a good java interface and clear semantics.
On the whole, we think hbase is a more suitable choice. First of all, its data is naturally redundant by hdfs, the ladder runs stably for three years, and the data 100% is reliable, which has proved the security of hdfs cluster and its ability to serve massive data. Secondly, the data reading and writing service of hbase itself is not limited to a single point, and the service capacity can increase linearly with the growth of the server, reaching the scale of tens or hundreds. The design of LSM tree mode makes the write performance of hbase very good, and a single write can usually be completed within 1-3ms, and the performance will not decrease with the increase of data volume.
The region (equivalent to the sub-table of the database) can be dynamically segmented and moved at the ms level, which ensures the load balance. Because the data model on hbase is stored in the order of rowkey, the continuous whole block of data will be read as a cache at one time. A good rowkey design can make batch reading very easy, even it only takes 1 io times to get dozens or hundreds of pieces of data that users want. Finally, most engineers in Taobao are students with java background, so the api of hbase is very easy for them and the training cost is relatively low.
Of course, it must also be pointed out that there is no silver bullet in the background of big data, and hbase itself has inappropriate scenes. For example, the index only supports the main index (or is regarded as the main composite index), and for example, the service is single-point, and some data it is responsible for will not be served in the process of being restored by the main after a single machine is down. This requires you to have enough knowledge of your own application system when selecting the type.
3 Application situation
We began to study how hbase can be used for online services from March 20 1 1. Although there have been dozens of offline services before an Amoy search. This is because the early version of hbase aims at offline services in massive data. The release of version 0.20.0 in September 2009 is a milestone, and online application has officially become the goal of hbase. Therefore, hbase introduced zookeeper as the management of backupmaster and regionserver. Version 20111.90.0 is another milestone. Basically, the major websites we see today, such as hbase used for production in facebook/ebay/yahoo, are based on this version (the structure of version 0.89 adopted by fb is similar to that of version 0.90.x). With the addition of Bloomfilter and many other attributes, the performance has been greatly improved. Based on this, Taobao also chose the 0.90.x branch as the basis of the online version.
The first online application is the prom in the data cube. Prom was originally built based on redis. Due to the increasing amount of data and the change of demand, we use hbase to transform its storage layer. To be precise, prom is more suitable for version 0.92 of hbase, because it requires not only high-speed online reading and writing, but also complex applications such as count/group by. But because the 0.92 version was not mature at that time, we realized the coprocessor ourselves. Prom's data import comes from the ladder, so we spend half an hour writing the data of the ladder to hdfs where hbase is located every night, and then do a client-side forwarding of the web layer. After a month of data comparison, it is confirmed that the speed ratio of redis has not decreased significantly, and the data is accurate and can be successfully launched.
The second online application is TimeTunnel, which is an efficient, reliable and extensible real-time data transmission platform, widely used in real-time log collection, real-time data monitoring, real-time feedback of advertising effect, real-time database synchronization and other fields. Compared with prom, it is characterized by the addition of online writing. Dynamic data growth poses a great challenge to many characteristics such as compression/balance/split/recovery on hbase. TT writes about 20TB every day and reads about 1.5 times as much. To this end, we have prepared 20 regionserver clusters. Of course, the underlying hdfs is open, and the number is larger (mentioned below). Every day TT creates different tables for different services on hbase, and then writes the data into the tables. Even if we set the upper limit of the area size to 1GB, the largest service will reach the scale of thousands of areas, and it can be said that there will be several splits every minute. During the launch of TT, we fixed many bugs about splitting in hbase, and several submissions reached the hbase community. At the same time, we also put some latest community patches on our version. The bug related to Split should be said to be one of the biggest data loss risks of hbase, which every developer who wants to use hbase must keep in mind. Because hbase adopts LSM tree model, there is almost no possibility of data loss in terms of architecture principle, but there is a risk of data loss if you are not careful in practical use. The reason will be emphasized separately later. TT During the pre-release process, data was lost due to meta-table damage and bugs in split, so we also wrote a meta-table recovery tool separately to ensure that similar problems will not occur in the future (similar tools have been added in the later versions of hbase-0.90.5). In addition, due to the instability of our computer room where TT is stored, there have been many downtime accidents and even suspended animation. Therefore, we also began to modify some patches to improve the downtime recovery time and strengthen the intensity of monitoring.
CTU and member center projects are two projects with high online requirements. In these two projects, we especially studied the slow response of hbase. There are four reasons for the slow response of hbase: network reasons, gc problems, hit rate and client deserialization. We have now made some solutions to them (which will be introduced later) to better control the problem of slow response.
Similar to Facebook, we also use hbase as the storage layer of real-time computing projects. At present, some internal real-time projects have been launched, such as real-time page click system, Galaxy real-time transaction recommendation, live broadcast room, etc. The users are small operators scattered all over the company. Unlike puma of facebook, Taobao does real-time calculations in many ways. For example, Galaxy uses actor mode similar to affa to process transaction data, and calculates the ranking (TopN) through dimension tables such as related commodity tables, while the real-time page click system is developed based on twitter open source storm, and real-time log data is obtained in the background through TT. The calculation process saves the intermediate results and the dynamic dimension table to hbase. For example, we design rowkey as url+userid to read real-time data and realize real-time calculation of uv in all dimensions.
Finally, I'd like to mention the historical transaction order item in particular. This project is actually a transformation project, which aims to migrate from the previous solr+bdb scheme to hbase. Because it is related to the purchased page, users use it very frequently, and its importance is close to the core application, and it has zero tolerance for data loss and service interruption. It optimizes compression to prevent compression with a large amount of data during service hours. A user-defined filter is added to realize paging query, and the application is skillfully designed on rowkey, which avoids the transmission of redundant data and the conversion of more than 90% reading into sequential reading. At present, the cluster stores more than tens of billions of order data and hundreds of billions of index data, and the online failure rate is 0.
With the development of business, our customized hbase cluster has been applied to more than 20 online applications and hundreds of servers. Including real-time recommendation of goods on Taobao homepage, real-time quantum statistics widely used by sellers and other applications, and there is a trend of continuous increase and close to core applications.
4 Deployment, Operation and Monitoring
Facebook revealed its hbase architecture before, which can be said to be very good. For example, they divided the hbase cluster of message service into several clusters according to users, each cluster has 65,438+000 servers, and there is a namenode, which is divided into five racks, and each rack has a zookeeper. It can be said that this is an excellent architecture for services with large data volume. For Taobao, because the data volume is far from so large and the application is not so core, we adopt the architecture of public hdfs and zookeeper cluster. Try not to exceed 100 units per hdfs cluster (this is to limit the single point problem of namenode as much as possible). Set up several hbase clusters on it, and each cluster has a master and a backupmaster. The advantage of public hdfs is that it can minimize the compact impact and share the cost of hard disk equally, because there are always clusters with high requirements for disk space and clusters with low requirements for disk space, so it is more cost-effective to mix them together. Zookeeper clusters are very common, and each hbase cluster belongs to a different root node on zk. The independence of hbase cluster is guaranteed by zk's authority mechanism. The common reason of zk is just for the convenience of operation and maintenance.
Because it is an online application, operation and monitoring become more important. Because the previous experience is close to zero, it is difficult to recruit specialized hbase operation and maintenance personnel. Our development team and operation and maintenance team attached great importance to this issue from the beginning and began to cultivate themselves very early. Let's talk about our experience in operation and monitoring.
An important part of our customized hbase function is to increase monitoring. Hbase itself can send ganglia monitoring data, but the monitoring items are far from enough, and the display mode of ganglia is not intuitive and prominent enough. Therefore, on the one hand, we creatively add many monitoring points to the code, such as the time consumption of compressing/splitting/balancing/refreshing the queue and each stage, the response time of each stage of reading and writing, the number of times of reading and writing, the on/off of the region, and the number of times of reading and writing at the table level and the region level. They are still sent to ganglia through socket, and ganglia will record them in rrd file. The characteristic of rrd file is that the accuracy of historical data will be lower and lower, so we write our own program to read the corresponding data from rrd and persist it to other places. Then we use js to realize a set of monitoring interface, focusing on summarizing and displaying the data we care about in various ways, such as trend chart and pie chart, so that we can view any historical data without losing accuracy. At the same time, some very important data, such as reading and writing times, response time, etc. , will be written into the database to realize custom alarms such as fluctuation alarm. Through the above measures, we can always find cluster problems before users and fix them in time. We use redis efficient sorting algorithm to sort the reading and writing times of each region in real time, and we can find those regions with high specific requests under high load and move them to the idle regionserver. During the peak period, we can sort hundreds of thousands of areas of hundreds of machines in real time.
In order to isolate the impact of the application, we can check the connections from different clients at the code level and cut off the connections of some clients, so as to isolate the fault within an application without amplifying it when it occurs. Mapreduce applications will also be controlled to run during the low peak period, for example, we will close jobtracker during the day.
In addition, in order to ensure the availability of the service from the results, we will also run commands such as reading and writing tests, table building tests, and hbck on a regular basis. Hbck is a very useful tool, but it is also a heavy work operation, so try to reduce the number of calls of hbck and try not to run hbck services in parallel. Hbck before 0.90.4 will have some opportunities to close hbase. In addition, in order to ensure the security of hdfs, it is necessary to run fsck regularly to check the status of hdfs, such as the number of copies of blocks.
We will track the logs of all online servers every day, find out all the error logs and send them to the developers by email, and find out the causes and repair methods of the problems on each error. Until the error is reduced to zero. In addition, if there is a problem with each hbck result, it will also be sent to the developer by email for processing. Although not every error will cause problems, and even most errors are just normal phenomena in distributed systems, it is very important to understand the causes of the problems.
5 test and release
Because it is an unknown system, it pays great attention to testing from the beginning. Testing is divided into performance testing and functional testing from the beginning. Performance testing is mainly based on benchmark testing, which is divided into many scenarios, such as different mixed read-write ratios, different k/v sizes, different column families, different hit rates, whether to do pre-ha, and so on. Each run will last for several hours to get accurate results. So we write an automation system, select different scenarios from the web, and the background will automatically transmit the test parameters to the server for execution. Because it tests a distributed system, the client must also be distributed.
The basis for us to judge whether the test is accurate is whether the same scene is run many times, and whether the data and the running curve are more than 99% coincident. This work is very tedious and consumes a lot of time, but later facts prove that it is very meaningful. Because we have established 100% trust in it, it is very important, for example, our later improvement can be accurately captured even if it is only 2% performance improvement, and for example, a code modification has caused some ups and downs on the compact queue curve, which was seen by us, thus finding a bug in the program, and so on.
Functional testing is mainly interface testing and anomaly testing. The general function of interface testing is not obvious, because the unit testing of hbase itself has covered this part. However, anomaly testing is very important. Most of our bug modifications are found in abnormal testing, which helps us to remove many unstable factors that may exist in the production environment. We also submitted more than a dozen corresponding patches to the community, which received attention and commitment. The difficulty and complexity of distributed system design lies in exception handling, and the system must be considered unreliable at any time of communication. For some problems that are difficult to reproduce, we will reproduce them by checking that the code has roughly located the problem and forcibly throwing exceptions at the code level. It turns out that this is very useful.
In order to locate the problem conveniently and quickly, we designed a set of log collection and processing program, which can conveniently grab the corresponding logs from various servers and summarize them according to certain rules. This is very important to avoid wasting a lot of time logging in to different servers to find clues about bugs.
Due to the continuous development of hbase community and the new bugs found in online or test environment, we need to make a regular release model. We should not only avoid the instability caused by frequent release, but also avoid the bug outbreak that leads to the production version getting farther and farther away from the development version or hidden. It is mandatory for us to publish a version from the internal backbone every two weeks, and we must pass all tests including regression tests, and run it on a small cluster for 24 hours without interruption after publication. Once a month, the latest version is released, and the existing clusters are released in order of importance to ensure that important applications are not affected by potential bugs in the new version. Facts have proved that since we introduced this publishing mechanism, the unstable factors brought by publishing have been greatly reduced, and the online version can be maintained not far away.
6 improvement and optimization
Facebook is a very respectable company. They published all the changes to hbase without reservation, and opened the version they actually used internally to the community. An important feature of facebook's online application is that they shut down split to reduce the risk brought by split. Different from facebook, Taobao's business data is not so huge, and because of the rich application types, we don't ask users to choose to forcibly close split, but try to modify possible bugs in split. So far, although this problem can't be completely solved, many bugs related to splitting and downtime exposed from 0.90.2 have been fixed to near 0 in our test environment, and the stability-related patches of 10 have been submitted to the community. The most important ones are as follows:
Mitor helped us get back to version 0.90. So after the community released five versions of bugfix from 0.90.2 to 0.90.6, the 0.90.6 version is actually relatively stable. It is suggested that the production environment can consider this version.
Split this is a heavy transaction, and it has a serious problem that it will modify the meta table (of course, it also has this problem when it is down). If an exception occurs during this period, it is very likely that the meta table, rs memory, main memory and file on hdfs are inconsistent, resulting in an error when redistributing the area later. One of the mistakes is that the same area may be served by more than two regionserver, and then the data served by this area may be randomly written into multiple rs, which will be read separately when reading, resulting in data loss. If you want to restore the original state, you must delete the area on one of the rs, which leads to the active deletion of data, which leads to data loss.
The problems of slow response mentioned above can be summarized as network reasons, gc problems, hit rate and client deserialization. The network reason is generally caused by network instability, but it may also be a problem of tcp parameter setting, so it is necessary to ensure that the packet delay is reduced as much as possible. For example, nodelay needs to be set to true. We specifically locate these problems through a series of tools such as tcpdump, and prove that tcp parameters do cause slow connection in package assembly. Gc should be based on the type of application. Generally speaking, in applications with a large amount of reading, the new generation cannot be set too small. The hit rate greatly affects the response time. We will try to set the version number to 1 to increase the cache capacity. Good balance also helps to give full play to the hit rate of each machine. To this end, we designed a desktop balance.
Because the hbase service is a single point, that is, a machine is down, the data served by this machine cannot be read or written until it is restored. The speed of downtime recovery determines the availability of our services. To this end, a number of optimizations have been made. First, shorten the downtime discovery time of zk to 1 minute as much as possible. Secondly, the recovery log of the main server is improved to parallel recovery, which greatly improves the speed of the recovery log of the main server. Then, we modify some timeout exceptions and deadlocks that may appear in openhandler, and delete open…too long and other exceptions that may appear in the log. Fixed the problem that the original hbase could not be restarted for several minutes or even half an hour when 10 was down. In addition, at the hdfs level, we shortened the time of socket.timeout and retry to reduce the long-term blocking phenomenon caused by datanode downtime.
At present, we haven't done much work to optimize the reading and writing level of hbase itself. The only patch is that the writing performance drops seriously when the area increases. Because of the good performance of hbase itself, we found excellent parameters in various application scenarios through a lot of tests, and applied them to the production environment, which basically met the requirements. But this is our next important job.
7 Future plans
We are currently maintaining a customized version of hbase based on community 0.90.x in Taobao. Next, in addition to continuing to fix its bugs, it will maintain the modified version based on 0.92.x The reason for this is that the compatibility between 0.92.x and 0.90.x is not very good. There are a lot of codes modified by 0.92.x, and the rough statistics will exceed 30%. There are some features in 0.92 that we value very much.
Version 0.92 improved hfile to hfilev2, and v2 is characterized by a great transformation of the index and bloomfilter to support a single large hfile document. When the file size of the existing HFile reaches a certain level, the index will occupy a lot of memory and the speed of loading files will be greatly reduced. However, if HFile does not increase, the area cannot be expanded, resulting in a very large number of areas. This is something we hope to avoid as much as possible.
Version 0.92 improves the communication layer protocol and increases the length of the communication layer, which is very important. It allows us to write the client of nio, so that deserialization will no longer affect the performance of the client.
Version 0.92 adds coprocessor features to support a small number of applications that want to rely on rs.
There are many other optimizations, such as improving balance algorithm, improving compact algorithm, improving scan algorithm, changing compact to CF level, dynamically doing ddl and so on.
In addition to version 0.92, version 0.94 and the latest trunk(0.96) also have many good features, and 0.94 is the performance-optimized version. It has done a lot of revolutionary work, such as removing the root table, such as HLog compression, supporting multiple slave clusters in replication and so on.
We also have some optimizations ourselves, such as our own secondary index and backup strategy, which will be implemented in the internal version.
It is also worth mentioning that the optimization of hdfs level is also very important. Improvements of hadoop- 1.0.0 and cloudera-3u3 are very helpful to hbase, such as localized reading, improvement of checksum, keepalive setting of datanode, HA strategy of namenode, etc. We have an excellent hdfs team to support our hdfs work, such as locating and fixing some hdfs bugs, helping to provide some suggestions on hdfs parameters, and helping to realize the HA of namenode. The latest test shows that 3u3 checksum+localized reading can at least double the random reading performance.
One meaningful thing we are doing is to monitor and adjust the load of regionserver in real time, which can dynamically move the servers with insufficient load in the cluster to the cluster with high load, and the whole process is completely transparent to users.
Generally speaking, our strategy is to cooperate with the community as much as possible to promote the development of hbase in the whole apache ecosystem and industry, so that it can be deployed to more applications more stably and reduce the use threshold and cost.