Prepare knowledge of binary and ASCII coding
The basic storage unit of a computer is bytes. ASCII code is a scheme to encode common symbols in bytes, which is popular because of its rationality. Because a byte has 8 bits, ASCII can encode 2 8 = 256 characters at most. The first 128 is called standard ASCII code (binary number 0000000-0 111165438) and the second165438 is called extended ASCII code. 00000008 For example, the Chinese character "Wang" occupies two ASCII codes of 205 and 245 respectively, with CD and F5 in hexadecimal and1111and1/kloc in binary. That is, when the computer processes the Chinese character "Wang", the information in the computer is "11111165438+". Another example is that the ASCII code of the capital English letter "A" is 65, and the hexadecimal representation is 4 1. The information in the computer is actually "0 10000 1".
Contraction compression
After knowing the above principle, let's introduce the principle of "contraction and compression". Abbreviation is to reduce unnecessary "bits" in coding. For example, there are no Chinese characters in the file, which means that the extended ASCII code is not used in the content, so the seventh (first) bit of all character codes will be 0. With this, we can shrink this bit, assuming that the file content is ABCDEFGH.
File content: ABCDEFGH
Binary content: 010000010001000010000100065438. 1000 1 1 1 0 100 1000
Compressed file content: [The content displayed in Chinese is garbled, so it cannot be written]
Binary content:100001000100011kloc-0/00065438+.
This compression process is to remove all zeros from the original header and rearrange them every 8 bits, so that the file that originally occupied 8 bytes only occupied 7 bytes. As long as the seventh bit 0 is added during decompression, the file can be restored to its original state. This compression technique is especially suitable for digital compression. Because the ASCII code occupied by the ten Arabic numerals from 0 to 9 is from 00 1 1 0000-0011,and the first four digits are "0011".
Simple compression
The principle of direct compression is the easiest to understand, because sometimes, there will inevitably be consecutive characters in the file, such as adding a line "※※※※※※※※※※ ※ at the end of the file. In this way, you only need to remember this symbol and the number of repetitions when compressing, and you can quickly restore it.
Dictionary compression
Dictionary compression is the most important compression technology and the most widely used compression technology. This technology searches for repeated strings in files, such as "People's Republic of China (PRC)" and "Reform and Opening-up". After recording (the recorded content is called "dictionary"), it is replaced by another short code in the text. Think about how many characters "Windows" and "Microsoft" are filled in the Windows system, and you will understand why this compression technology is so effective for the Windows operating system. This compression scheme is especially suitable for political manuscripts and academic papers.
Dictionary compression technology is equally effective for text files and executable code files, and can cover "direct compression" technology. Now popular compression software such as ZIP, ARJ, RAR and AIN all adopt this technology. However, in this technology, the appropriate dictionary length is very important. Setting the dictionary too large or too small will seriously affect the compression effect, and the compression speed is relatively slow.
Most compression software uses a variety of compression technologies.
Vector compression
Although dictionary compression is powerful, it can't do anything with some file contents, such as:
Ah, hail, glass, long touch, Dan ingot method
In fact, these seemingly unwritten Chinese characters are intrinsically related. They are 160 1, 1702, 1803, 1904, 2005 and 2 106 respectively. In this case, the memory can be compressed by finding the mathematical relationship between them (such as series, equation, etc.). This kind of memory compression is called vector compression, which is a new compression technology.
Vector compression can sometimes bring us unexpected enjoyment. Many people are surprised that FLASH can bring us so much information in such a small volume, because vector compression technology is used in FLASH. Memorizing the trajectory of a point with an equation is much less than memorizing all the positions of this point. On the other hand, the current vectorization technology can not find a high-fidelity and regular scheme from photos and recordings, so there is still room for the next compression technology.
Lossy compression and VCD
The emergence of VCD is attributed to the efforts of Joint Photographic Experts Group (JPEG). They put forward a brand-new compression technical standard, which can also be said to be a brand-new compression concept. This concept catalyzed the birth of MPEG standard and the realization of VCD industrialization. JPEG image compression technology takes the dot matrix of every 8*8 points of the image as a processing unit. In this range, if all of them are a certain color and only a few other colors, then other colors will be ignored. Theoretically, the compression ratio of this compression technology is as high as 64: 1. Now a 64MB file only needs 1MB? This is really exciting. In order to further expand the compression effect and improve the application scope of this technology, JPEG has been flexibly adjusted. Allows users to set the size of the processing unit and the degree of ignoring other colors, which is why JPEG images have the "quality" attribute.
The concept of "lossy compression" proposed by JPEG makes this compression technology have some limitations. For example, JPEG is not suitable for compressing engineering drawings, medical images and other materials. However, its practical ideas have greatly inspired people, and RealPlayer has taken the lead in realizing real-time online video playback along this road. And the sound stripped of images in VCD has gradually formed popular MP3 music. The coding scheme of sound compression is too complicated to be discussed in this paper. )
Compressed file gap
In addition to these compression technologies, the DOS/Windows system itself also left a compression story for everyone. Under the DOS/Windows system, the disk storage space is divided into small pieces for use, instead of mixing all files under the control of the system like UNIX or Novell. Although this open disk file format is unsafe (not safe at all), it is efficient and easy to operate. This may also be an important reason why DOS/Windows beat UNIX and Novell in the home and business markets, but it is always inferior to them in the server field. -Because each allocation block can only be used by one file, even if the file (or the last block of the file) has only one byte, it must occupy one allocation block. Because there are only two bytes left to allocate this memory block (two bytes are 16 bits, and this allocation mechanism is called FAT 16), no matter how big the partition is, it can only be divided into two 16 = 65536 allocation blocks at most. For example, the allocation block size of a 2GB partition is 32KB;; When the partition exceeds 2GB, the allocation block must be increased to 64KB. Think about it, if a byte file also takes up 64KB, can you leave it alone? Therefore, starting from the Windows 95 OSR2 version, Microsoft introduced the FAT32 solution. But even so, the "file gap" still exists.
In order to solve the file gap, Microsoft introduced DBLSPACE in DOS 6.0 era, and later changed it to DRVSPACE, which still exists on Windows 95/98/ME. At that time, it was said that it could double the capacity of the hard disk, which made everyone excited, but after trying it, they shouted and fell for it. It turns out that Microsoft just copied from others and used the "virtual volume" technology, which can save the gap between files at most, which is useless for users who only put one large file on the whole disk.
Now there is a better way to compress file gaps, that is, to package uncommon files into a package with WINZIP, especially a large number of small files and/or in the environment of FAT 16. Using this method can save you a lot of disk space. But in any case, the "file gap" seems to exist forever in the Windows system.
The greater the pressure?
Will the file get bigger and bigger? The answer is: yes. Because compressed files need a file header (file format, dictionary, etc. ) Control decompression. When compressing an already "overwhelmed" file, only one file header will be added, and the file will of course become bigger and bigger. In addition, although compressed files are more space-saving and safer (compressed files can be encrypted, but ordinary text files can't), if the file header is damaged, the whole file can't be decompressed. So it is very important to compress the file header. This is in contrast to the difference between the FAT format and the UNIX/Novell volume format just mentioned. However, if everyone's ZIP files are damaged, it is recommended to try the DOS version of the ZIP decompressor PKUNZIP, and maybe some of them can be saved.
Compression of executable files
Not only document files and data files can be compressed, but also executable files can be compressed. When PKWARE Inc, which is committed to compression technology, first introduced PKZIP software (about 1990), there were mainly three programs, namely PKZIP (for compression), PKUNZIP.EXE (for decompression) and PKLITE.EXE (for compression of executable files). The process of compressing executable files is amazing. The file name will not change, but the length will be reduced. Such a compressed file will release itself in memory when it is executed, and then it will be relocated and reloaded before execution. Because the computer is completed in an instant, I hardly feel that the file has been compressed. This tool is very useful in the era when floppy disks are popular.
Now the programs under Windows are getting bigger and bigger, so many programmers compress their main programs, which can also play an anti-piracy role. The famous "Red Alert" took this approach. With the development of internet communication software, many softwares are packaged into executable programs, which can be expanded and installed by themselves after clicking. These are also examples of executable file compression.
Dialectical analysis of compression technology
From a historical point of view, compression technology is doomed to perish. Let's look at the DOS era before 10. At that time, the compression work for storage purposes has now been submerged in the capacity of mass storage devices. Theoretically speaking, compression wastes our time and energy. If there is enough storage space, we have no reason to compress it. Looking at the current compression purposes, except for a small part to facilitate retrieval, a large number of compression is to adapt to the slow transmission speed of the Internet. Then, do we still need to compress when the network speed can meet the need of dragging the whole hard disk content on the network at any time? When the capacity of the CD is large enough, will we tolerate JPEG technology to lose one or two color points for us?
However, philosophy guides us that things are always developing and always have other characteristics. When capacity is no longer the purpose of compression, transmission becomes another purpose of our compression. Who can predict whether the next purpose of compression will happen and what it will be? (End of full text)