Current location - Music Encyclopedia - Today in History - How to deal with 430 million records every day in SQLServer
How to deal with 430 million records every day in SQLServer
Project background

This is a data center project. The difficulty of this project is surprising. This project really makes me feel that shopping malls are like battlefields, and I am just one of them. There are too many tactics, too many high-level contests and too many insiders. I will write another blog post about the details of this project when I have time.

This project needs environmental monitoring. For the time being, we will call the monitored equipment acquisition equipment, and the attributes of the acquisition equipment are called monitoring indicators. Project requirements: The system supports monitoring indicators not less than 10w, the data update of each monitoring indicator does not exceed 20 seconds, and the storage delay does not exceed 120 seconds. Then, through simple calculation, we can get the ideal data to be stored: 30w per minute, 1800w per hour, which is 432 million per day. In reality, the amount of data will be about 5% larger than this. In fact, most of them are information junk, which can be handled by data compression, but others just want to fuck you. What can they do? )

These are the indicators required by the project. I think many students who are experienced in big data processing will sneer at this. Is that all? Well, I've seen a lot of big data processing, but I haven't dealt with it before. Looking at other people's ideas, distribution, and separation of reading and writing seem to be really easy to solve. However, the problem is not so simple. As I said above, this is a very bad project and a typical project of vicious competition in the industry.

There are no more servers, but this server supports 30w northbound interface (SNMP) in addition to database and centralized collector (that is, the program for data analysis, alarm and storage). Before the program optimization, the CPU occupied more than 80% all the year round. Because the project needs to use dual-machine hot standby, in order to save trouble and reduce unnecessary troubles, we put related services together to make full use of the characteristics of HA (HA system purchased from outside).

The accuracy of system data is extremely abnormal, which requires that every data from the bottom acquisition system to the top monitoring system should not be worse than our system architecture. As you can see, the database is under great pressure, especially at the LevelA node:

The hardware configuration is as follows: CPU: Intel? Xeon? Processor E5-2609 (4 cores, 2.40GHz, 10MB, 6.4 GT/s) memory: 4GB (2x2GB) DDR 3 rdimm memory, 1333 MHz, ECC hard disk: 500GB 7200 rpm 3.5'' SATA 3 hard disk, Raid5.

The database version adopts SQLServer20 12 standard version, and the genuine software provided by HP lacks many NB functions of enterprise version.

Write bottleneck

The first obstacle we encountered was that we found that under the existing programs, SQLServer could not handle so much data at all. What is the specific situation?

Our storage structure

Generally, in order to store a large number of historical data, we will make a physical table, otherwise millions of records a day will be hundreds of millions a year. Therefore, our initial table structure is as follows:

Create table [dbo]. [his20 1 40822] ([no] [bigint] identity (1,1) is not empty, [Dtime] [datetime] is not empty, [MgrObjId] [varchar](36) is not empty, [id] [Value] [varchar](50) is not empty, and the primary key of constraint [PK_His20 140822] is aggregated with (PAD_INDEX = OFF) ([No] ASC).

No as a unique identifier, collecting device Id(Guid), monitoring index Id(varchar(50)), recording time and recording value. The acquisition device Id and monitoring index Id are used as indexes for quick search.

Batch write

BulKCopy was used, yes, that's it. It is said that millions of records are written in seconds.

public static int BatchInert(string connectionString,string desTable,DataTable dt,int batch size = 500){ using(var SBC = new SqlBulkCopy(connectionString,SqlBulkCopyOptions。 UseInternalTransaction){ BulkCopyTimeout = 300,NotifyAfter = dt。 Rows.Count, BatchSize = batchSize, destinationtablename = destable}) {foreach (DataColumn column in dt. Column) sbc. ColumnMappings.Add(column。 ColumnName, column. column name); sbc。 write to server(dt); } returns dt. Number of rows. Count; }

What's the problem?

The above architecture, with 40 million data per day, is ok. However, when the configuration is adjusted to the above background, the memory of the centralized monitoring program overflows. The analysis shows that too much received data is put into memory, but not written into the database in time, which eventually leads to more generated data than consumed data, which leads to memory overflow and the program cannot run.

Where is the bottleneck?

Is it because of the RAID disk? Is it a data structure? Is it the hardware? Is it SQLServer version? Is there no partition table? Or is it a procedural problem?

It was only a week. Maybe one week, the project supervisor will ask us to get rid of it. So, with the feat of working continuously for 48 hours, we had to call people everywhere to catch chickens.

However, what is needed at this time is to calm down, and then calm down the SQLServer version? Hardware? It is unlikely to change at present. RAID disk array, I don't think so. What's that? It's fucking cold.

You may not be aware of the tense atmosphere at the scene. In fact, after so long, it is difficult for me to return to that situation myself. But it can be said that maybe we have all kinds of methods now, or we have more ideas when we are outsiders, but when a project forces you to give up soon, your thoughts and considerations at that time may be greatly deviated under the constraints of on-site environmental factors. It may make you think quickly or stop thinking. In this high-pressure environment, some colleagues even made more low-level mistakes, their thinking was completely confused and their efficiency was lower. They didn't sleep for 36 hours, or just squatted on the construction site for two or three hours (there was mud everywhere in rainy days, and it would be mud when it was dry), and then they continued to work for a week! Still go on!

Many people have given many ideas, but they seem to be useful and useless. Wait, why? Seemingly useful, but actually useless? ? I vaguely seem to have grasped a direction. What is this? By the way, we are running in the wild now, and there was no problem before, which does not mean that there is no problem under pressure now. Such a small function is too influential to be analyzed in a large system, so it needs to be decomposed. Okay, okay. Unit test? , that is, the test of a single method, we need to verify each function, where is each independent step?

Step-by-step test to verify system bottleneck

Modifying the parameters of BulkCopy First of all, I think that modifying the parameters of BulkCopy, such as BulkCopyTimeout and BatchSize, will always fluctuate within a certain range and have no practical impact. It may affect the counting of some CPU, but it is far from what I expected. The writing speed still fluctuates around 5 seconds 1w~2w, which is far from the requirement of writing 20w in 20 seconds.

According to the storage device, yes, the above structure is recorded according to each index and value. Is it too wasteful? So is it feasible to take the acquisition equipment+acquisition time as the record? The question is, how to solve the problem that different acquisition devices have different attributes? At this time, a colleague can give full play to his talents, and monitoring indicators+monitoring values can be stored in XML format. Wow, can it still be like this? The query can be in the form of for XML.

So there is this structure: No, MgrObjId, Dtime, XMLData.

The results show that it is slightly better than the above, but it is not obvious.

Data table partition? I hadn't learned this skill at that time. I read this article online. It seems quite complicated, and time is running out, so I dare not try.

Stop other programs. I know this will definitely not work, because the software and hardware architecture can't be modified for the time being. But I want to verify whether these factors affect it. The original hint is really obvious, but it still doesn't meet the requirements.

Is it the bottleneck of SQLServer? There is no way out. Is this the bottleneck of SQLServer? I looked up relevant information on the internet, which may be the bottleneck of IO. What can I do, Nima? Do I need to upgrade the server and replace the database? But will the project party give it to me?

Wait, there seems to be another thing, an index, right! The existence of index will affect the insertion, update

Delete index

Yes, the query will definitely be slow after deleting the index, but I must first verify whether deleting the index will speed up the writing. If the index of MgrObjId and Id fields is deleted decisively.

After running, a miracle appeared. Every time 10w record is written, it can be written in 7~9 seconds, which meets the system requirements.

How to solve the query?

A table needs more than 400 million records a day, and it is impossible to query without an index. What to do! ? I thought of our old method, the physical table. Yes, we used to rank by talents, but now we rank by hours. Then 24 tables, each table only needs to store about 1800w records.

Then query the history of an attribute within an hour or several hours. The result is: slow! Slow down! ! Slow down! ! ! It is inconceivable to query more than100000 records without an index. What else can I do?

Continue to divide the table, I think, we can continue to divide the table according to the bottom collector, because different collectors have different collection equipment, so when we query the historical curve, we only need to look at the historical curve of a single indicator and it can be dispersed into different tables.

As a result, by collecting 10 embedded tables and dividing them into 24 hours, 240 tables are generated every day (the historical table name is similar to this: his _ 001_ 2014112615). ! !

Query optimization

After the above problems are solved, the difficulty of this project is solved by half, and the project supervisor is embarrassed to find fault. I don't know what kind of tactical arrangement.

It's been a long time, and now it's almost the end of the year, and the problem is coming again, that is, dragging you to death so that you can't take other projects at the end of the year.

This time, the requirements are as follows: because the above monitoring indicators are all simulated at 10w, but now they are actually online, but there are only about 5w devices. Then this obviously can't meet the requirements of the tender and can't be accepted. So what should we do? These smart people thought, since the monitoring index is halved, then we will halve the time, won't it be realized? That is to say, according to the current 5w equipment, you should put it into storage within 10s. Shit, according to your logic, if we only have 500 monitoring indicators, shouldn't we put them in storage within 0. 1 second? Don't you think about the feelings of those monitored devices?

But what can you do if others want to play with you? Take it. As a result, after the time was reduced to 10 second, the problem came. A careful analysis of the above logic shows that the table is divided by collectors. There are fewer people in the collection now, but the number has increased. What happens can be supported by words, but the record of each table is close to 400w, and some monitoring indicators of the acquisition equipment are close to 600w W. How to break it?

So the technicians held a meeting to discuss relevant measures.

How to optimize a query without an index?

Some colleagues have suggested that the order of where clauses will affect the query results, because after processing according to the results you choose, you can select some data first, and then continue to filter the next condition. Sounds reasonable, but doesn't SQLServer query analyzer automatically optimize? Forgive me for being a little white, but I just feel it. It should be automatically optimized like the compiler of VS.

Specific how, still want to use facts to speak:

Results After colleagues modified the client, the test feedback was greatly improved. I checked the code:

Does it really have such a big impact? Wait, did you forget to clear the cache, causing hallucinations? Therefore, let colleagues execute the following statements to get more information:

-before optimization, dbcc freeproccache dbcc dropcleanbuffers set statistics io on select dtime, and the value comes from dbo.his20 140825. Where dtime > ='' and Dtime & lt= ' ' AND mgr objid = ' ' AND Id = ' ' set statistics io off-optimized DBCC freeproccache DBCC drop buffers set statistics io on select Dtime, Value from dbo.his20140825, where mgrobjid ='' and id ='' and dtime > ='' and dtime < ='' Turn off statistical IO.

The results are as follows:

Better before optimization?

Looking at IO data carefully, we find that the pre-reading is the same, that is to say, the data records we want to query are the same, and the physical reading and table scanning are the same. Logical reads are slightly different, which should be caused by the number of cache hits. That is to say, the conditional order of where clause has no obvious influence on the optimization of query results without establishing index.

Then, we can only use the exponential method.

An attempt to create an index

Establishing an index is not a simple matter, it needs some basic knowledge. In this process, I took many detours and finally established the index.

The following experiments are based on the verification of the following record totals:

The idea of building an index for a single field is mainly influenced by the data structure I built. The data structure in my memory is a dictionary.

First, press MgrObjId to create an index with a size of 550M, which takes 5 minutes and 25 seconds. The result, like the prediction scheme shown above, doesn't work at all, but it is slower.

Indexing by multiple conditions is possible. Since it doesn't work, how about we index by multiple conditions? Create nonclustered index Idx _ his20141008 (mgrobjid, id, Dtime).

Therefore, the query speed is indeed doubled:

Wait, is this the beauty of indexing? It took 7 minutes and 25 seconds, and the space of 1. 1G was exchanged for this? There must be something wrong, so I started to look up information and read some related books. Finally, I made great progress.

Correct index

First of all, we need to understand the key points of several indicators:

After indexing, sorting by the index field with the least repetition will achieve the best effect. Take our desk as an example. If a clustered index of No is established, it is best to put No in the first position of the where clause, followed by Id, followed by MgrObjId and finally time. If the watch is an hour, you'd better not use it.

The order of the where clause determines whether the query analyzer uses the index to query. For example, if an index of MgrObjId and Id is established, where mgrobjid ='' and id ='' and dtime ='' will use index search, while where dtime ='' and mgrobjid ='' will not necessarily use index search.

Place the result column of the non-indexed column in the containing column. Because our conditions are MgrObjId and Id and Dtime, the returned result only needs to contain Dtime and value, so putting Dtime and value in the inclusion column will make the returned index result have this value, and the optimal speed can be achieved without looking up the physical table.

According to the above principles, we set up the following indexes: create a nonclustered index idx _ his20141008 ondbo.his201008 (mgrobjid id) include (value, dtime).

It takes more than 6 minutes, and the index size is 903M.

Let's look at the forecast plan:

As you can see, the index is completely used here, and there is no extra consumption. The actual execution result is less than 1 sec, and the result is not filtered out in the record of 1 100w for one second! ! Awesome! !

How to apply the index?

Now that I have finished writing and reading, how can I combine them? We can index the data of an hour ago, but we can't index the data of the current hour. In other words, don't create an index when creating a table! !

How to optimize?

You can try to separate reading from writing and write two libraries, one is a real-time library and the other is a read-only library. Real-time database is used for data query within one hour, and read-only database is used for data query one hour ago; Read-only library is stored regularly and then indexed; More than a week's data, analysis and processing before storage. In this way, no matter what time period the data is queried, the real-time database within one hour, the read-only database within one hour to one week, and the report database a week ago can be processed correctly.

If you don't need a physical table, you can rebuild the index periodically in a read-only library.

abstract

How to deal with billions of data (historical data) in SQLServer can be done as follows:

Delete all indexes from the table.

Insert using SqlBulkCopy

Divide tables or partitions to reduce the total data volume of each table.

Don't build an index until the table is completely written.

Specify index fields correctly.

Put the fields you need into the inclusion index (everything is included in the returned index).

Only required fields are returned when querying.

How to deal with 430 million records every day in SQLServer

Label: