Big data analysis, they have a high interest in the impact of enterprises. Big data analysis is to discover useful information such as patterns and correlations in the process of studying a large amount of data, which can help enterprises to better adapt to changes and make more informed decisions.
First, Hadoop.
Hadoop is an open source framework, which allows the whole cluster to store and process big data in a distributed environment using simple programming model computers. Its purpose is to expand from a single server to thousands of machines, and each machine can provide local computing and storage.
Hadoop is a software framework that can distribute large amounts of data. But Hadoop is handled in a reliable, efficient and extensible way. Hadoop
Is reliable, even if the computing element and storage fail, it will maintain multiple copies of working data to ensure that processing can be redistributed for the failed node. Hadoop is efficient, it works in parallel, and the processing speed is accelerated through parallel processing. Hadoop
It is also extensible and can handle PB-level data. In addition, Hadoop relies on community servers, so the cost is relatively low and anyone can use it.
Hadoop is a distributed computing platform that is easy to build and use. Users can easily develop and run applications dealing with massive data on Hadoop. It mainly has the following advantages:
1, high reliability. Hadoop's ability to store and process data bit by bit is trustworthy.
2. High scalability. Hadoop distributes data and completes computing tasks among available computer clusters, and can be easily extended to thousands of nodes.
3. High efficiency. Hadoop can dynamically move data between nodes to ensure the dynamic balance of each node, so the processing speed is very fast.
4. High fault tolerance. Hadoop can automatically save multiple copies of data and automatically reassign failed tasks.
Hadoop has a framework written in Java language, so it is ideal to run on Linux production platform. Applications on Hadoop can also be written in other languages, such as
C++ .
Second, HPCC.
HPCC, high performance computing and
Short for Communication (High Performance Computing and Communication). 1993, the United States Federal Science, Engineering and Technology Coordination Committee submitted a report on "Major Challenge Project: High Performance Computing and Communication" to Congress, which is also known as the HPCC Project, that is, the US President's Scientific Strategy Project, aiming at solving a number of important scientific and technological challenges by strengthening research and development. HPCC is a plan to implement the information superhighway in the United States. The implementation of this plan will cost tens of billions of dollars. Its main goal is to develop scalable computing systems and related software to support the transmission performance of Ethernet, develop gigabit network technology, and expand research and education institutions and network connection capabilities.
The project mainly consists of five parts:
1, High Performance Computer System (HPCS), including the research of future generations of computer systems, system design tools, advanced typical systems and original system evaluation, etc.
2. Advanced software technology and algorithm (ASTA), including software support for great challenges, new algorithm design, software branches and tools, computing and high-performance computing research centers, etc.
3. National Research and Education Grid (NREN), including the research and development of docking stations and 65.438 billion bit transmission;
4. Basic research and human resources (BRHR) includes basic research, training, education and course materials, aiming at increasing the innovative consciousness in the field of scalable high-performance computing by rewarding investigators (initial and long-term investigations), increasing joint ventures of skilled and trained personnel by improving education and high-performance computing training and exchanges, and providing necessary infrastructure to support these investigations and research activities;
5. Information Infrastructure Technology and Application (IITA) aims to ensure the leading position of the United States in the development of advanced information technology.
Third, the storm
Storm is a free, open source, distributed and highly fault-tolerant real-time computing system. Storm makes continuous flow calculation simple and makes up for the real-time requirements that Hadoop batch processing can't meet. Storm is often used in real-time analysis, online machine learning, continuous computing, distributed remote call and ETL. The deployment management of Storm is very simple, and the performance of Storm is outstanding among similar flow calculation tools.
Storm is a free open source software, a distributed and fault-tolerant real-time computing system. Storm can handle huge data streams very reliably and can be used to handle batch data of Hadoop. Storm is simple, supports multiple programming languages and is very interesting to use. The storm comes from Twitter, and other well-known application companies include Groupon, Taobao, Alipay, Alibaba, Music Element, Admaster and so on.
Storm has many applications: real-time analysis, online machine learning, uninterrupted computing, distributed RPC (remote procedure call protocol, requesting services from remote computer programs through the network),
ETL (short for extract-transform-load) and so on. The processing speed of Storm is amazing: after testing, each node can process 6.5438+0 million data tuples per second. Storm is scalable, fault-tolerant and easy to set up and operate.
Fourth, Apache exercises.
In order to help enterprise users find more effective ways to speed up Hadoop data query, Apache Software Foundation recently launched an open source project called "Drill". Street hooligans
Drill implemented Google's Dremel. "Drill" has been operated as an Apache incubator project, and will continue to be promoted to software engineers around the world.
This project will create an open source version of Google Dremel.
Hadoop tool (Google uses this tool to accelerate the Internet application of Hadoop data analysis tools). And "drilling" will help Hadoop users query massive data sets faster.
The "Drill" project is actually inspired by Google's Dremel project: this project helps Google to analyze and process massive data sets, including analyzing and crawling Web documents, tracking and installing them on Android.
Application data in the market, analysis of spam, analysis of test results on Google's distributed construction system, etc.
By developing the "Drill”Apache open source project, the organization will hopefully establish the API interface and flexible and powerful architecture to which Drill belongs, thus helping to support a wide range of data sources, data formats and query languages.
Verb (abbreviation of verb) fast mining machine
RapidMiner provides machine learning programs. Data mining includes data visualization, processing, statistical modeling and predictive analysis.
RapidMiner is the world's leading data mining solution, which adopts advanced technology to a great extent. Its data mining task covers a wide range, including various data arts, which can simplify the design and evaluation of data mining process.
Functions and characteristics
Provide data mining technology and library for free; 100% uses Java code (which can run in the operating system); The process of data mining is simple, powerful and intuitive. Internal XML ensures that the exchange data mining process is represented in a standardized format; Large-scale processes can be carried out automatically with simple scripting language; Multi-level data view to ensure effective and transparent data; Interactive prototype of graphical user interface; Command line (batch mode) automatic large-scale application; Java language (a computer language, especially for creating websites)
API (application programming interface); Simple plug-in and promotion mechanism; Powerful visualization engine, visual modeling of many cutting-edge high-dimensional data; Supported by more than 400 data mining operators; Yale University has been successfully applied in many different application fields, including text mining, multimedia mining, functional design, data stream mining, integrated development method and distributed data mining.
Limitations of RapidMiner; RapidMiner has a size limit on the number of rows; For RapidMiner, you need more hardware resources than ODM and SAS.
The intransitive verb Pentaho BI
Pentaho BI platform is different from traditional BI.
Product, which is a process-centered and solution-oriented framework. Its purpose is to integrate a series of enterprise BI products, open source software, API and other components to facilitate the development of business intelligence applications. Its appearance enables Jfree, Quartz and a series of independent products oriented to business intelligence to be integrated to form a complex and complete business intelligence solution.
Pentaho BI platform, Pentaho Open BI
The core architecture and foundation of the suite is process-centric, because its central controller is a workflow engine. Workflow engine uses process definitions to define process definitions in BI.
Business intelligence processes executed on the platform. You can easily customize the process and add new processes. Bisexual
The platform contains components and reports for analyzing the performance of these processes. At present, Pentaho's main components include report generation, analysis, data mining and workflow management. These components are implemented in the following ways.
Pentaho platform integrates J2EE, WebService, SOAP, HTTP, Java, JavaScript, Portals and other technologies.
Pentaho is mainly distributed in the form of Pentaho SDK.
Pentaho
SDK*** consists of five parts: Pentaho platform, Pentaho sample database, Pentaho platform that can run independently, Pentaho solution sample and a sample prepared in advance.
Pentaho web server. Among them, Pentaho platform is the most important part of Pentaho platform, which contains the main source code of Pentaho platform; Pentaho database is
Data services provided by the normal operation of Pentaho platform, including configuration information, solution-related information, etc. , which is not necessary for Pentaho platform and can be replaced by other database services through configuration; Pentaho platform that can run independently is an example of the independent running mode of Pentaho platform, which demonstrates how to make Pentaho platform run independently without the support of application server.
The Pentaho solution example is an Eclipse project that demonstrates how to develop related business intelligence solutions for the Pentaho platform.
Pentaho BI platform is based on servers, engines and components. These provide the J2EE of the system.
Server, security, portal, workflow, rule engine, chart, collaboration, content management, data integration, analysis and modeling functions. Most of these components are based on standards and can be replaced by other products.
Seven, druids
Druid is a real-time data analysis storage system and the best database connection pool in Java language. Druids can provide powerful monitoring and expansion functions.
Eight, anbari
Big data platform construction and monitoring weapon; Similarly, CDH.
1, providing Hadoop cluster.
Ambari provides a step-by-step wizard to install Hadoop services on any number of hosts.
Ambari handles the configuration of clustered Hadoop services.
2. Managing Hadoop clusters
Ambari provides centralized management for starting, stopping and reconfiguring Hadoop services for the whole cluster.
3. Monitor Hadoop cluster
Ambari provides a dashboard for monitoring the health and status of Hadoop clusters.
Nine, sparks
Large-scale data processing framework (can deal with three common data processing scenarios in enterprises: complex batch data processing (batch data
Processing); Interactive query based on historical data; Data processing based on real-time data stream, Ceph:Linux distributed file system.
X.Tableau public
What is 1.Tableau Public-a big data analysis tool?
This is a simple and intuitive tool. Because it provides interesting insights through data visualization. A still picture (played by a person on the stage)
Public's million-row limit. Because it is easier to use the fare than most other players in the data analysis market. Using Tableau's visual effects, you can investigate a hypothesis. In addition, browse the data and cross-check your opinions.
2. the use of 2.Tableau Public
You can publish interactive data visualization to the Web for free; No programming skills are required; Publish to Tableau
Public images can be embedded in blogs. In addition, you can share web pages through email or social media. * * * Your favorite content can be downloaded with effective sulfur. This makes it the best big data analysis tool.
3. Limitations of 3.Tableau Public
All data are public, and the scope of restricted access is very small; Data size limit; Unable to connect to [r; The only way to read it is through OData source, which is Excel or txt.
XI。 OpenRefine
1. What is open refine- a data analysis tool?
Data cleaning software, formerly known as GoogleRefine. Because it can help you clean up your data for analysis. It operates on a row of data. In addition, placing columns under columns is very similar to relational database tables.
2. Use of 2.OpenRefine
Clean up the messy data; Data conversion; Parsing data from websites; Add data to a dataset by getting data from a Web service. For example, OpenRefine can be used to geocode addresses based on geographic coordinates.
3. Limitations of 3.OpenRefine
Open Refine is not suitable for large data sets; Refining doesn't work for big data.
Twelve. KNIME
1, what is KNIME- data analysis tool?
KNIME helps you manipulate, analyze and model data through visual programming. It is used to integrate various components of data mining and machine learning.
2. the purpose of 2.KNIME
Don't write code blocks. Instead, you must delete and drag the connection points between activities; Data analysis tools support programming languages; In fact, analytical tools such as extensible running chemical data, text mining, python and [R.
3. limitations of 3.KNIME
Data visualization difference
Thirteen. Google fusion table
1. What is Google Fusion Table?
For data tools, we have a cooler and larger version of Google Spreadsheet. An incredible tool for data analysis, drawing and visualization of large data sets. In addition, Google
Fusion tables can be added to the list of business analysis tools. This is also one of the best big data analysis tools.
2. Use Google Fusion Tables.
Visualize larger tabular data online; Filter and summarize across hundreds of thousands of lines; Combine tables with other data on the Web; You can merge two or three tables to generate a single visualization containing a dataset;
3. Limitations of Google Fusion Table
Only the first100,000 rows of data in the table are included in the query results or mapped; The total size of data sent in API call cannot exceed 1MB.
Fourteen NodeXL
1, what is NodeXL?
It is the visualization and analysis software of relationships and networks. NodeXL provides an accurate calculation. This is a free (non-professional) and open source network analysis and visualization software. NodeXL is one of the best statistical tools for data analysis. This includes advanced network indicators. In addition, access to social media network data import programs and automation.
2. the use of 2.NodeXL
This is a data analysis tool in Excel, which can help realize the following aspects:
Data import; Graphic visualization; Graphic analysis; Data representation; The software is integrated into Microsoft Excel.
In 2007, 20 10/0,2013,2016. It opens as a workbook and contains various worksheets containing graphic structural elements. This is like nodes and edges; The software can import various graphic formats. This adjacency matrix, Pajak
. Net, UCINet. Dl, GraphML and edge list.
3. Limitations of 3.NodeXL
For specific problems, you need to use multiple seed terms; Run data extraction at slightly different times.
Fifteen, Volfram Alpha
1, what is Wolfram Alpha?
This is a computational knowledge engine or response engine created by stephen wolfram.
2. the use of 2.Wolfram Alpha
It is an add-on component of Apple Siri; Provide detailed response to technical search and solve calculus problems; Help business users get infographics and graphics. It also helps to create subject overviews, product information and advanced pricing history.
3. Limitations of 3.Wolfram Alpha
Wolfram Alpha can only deal with public figures and facts, not opinions; It limits the calculation time of each query; What's wrong with these statistical tools for data analysis?
Sixteen, Google search operators
1. What is a Google search operator?
It is a powerful resource to help you filter Google search results. This will immediately get the most relevant and useful information.
2. Use of Google search operators
Filter Google search results more quickly; Google's powerful data analysis tools can help discover new information.
Seventeen, Excel solver
1. What is Excel Programming Solution?
Solver add-in is a Microsoft Office Excel add-in program. Besides, this is your first time to install Microsoft.
Available in Excel or Office. It is a linear programming and optimization tool in excel. This allows you to set constraints. It is an advanced optimization tool, which helps to solve problems quickly.
2, the use of planning solution
The final value found by the solver is the solution of the relationship and decision; It adopts many methods and comes from nonlinear optimization. There are also linear programming to evolutionary algorithm and genetic algorithm.
3. Limitations of planning solution
Extended error is a missing aspect of Excel programming. Will affect the time and quality of solution; Planning solution will affect the inherent solvability of the model;
Eighteen. Decision support system for large station library
1. What is a large database DSS?
This is a collaborative data science software platform. In addition, it also helps team building, prototyping and exploration. Although, it can provide its own data products more effectively.
2. Use of 2.Dataiku DSS
The data analysis tool provides an interactive visual interface. Therefore, they can build, click, point to or use languages such as SQL.
3. Limitations of data warehouse decision support system
Limited visualization function; UI obstacle: overloading code/data set; The whole code is not easy to compile into a single document/notebook; Still need to integrate with SPARK
The above tools are only some tools used in big data analysis, so Bian Xiao will not list them one by one. Let's classify the uses of some tools:
1, front-end display
The front-end open source tools for demonstration analysis include JasperSoft, Pentaho, Spagobi, Openi, Birt, etc.
Business analysis tools for presentation analysis include Style Intelligence, RapidMiner Radoop, Cognos, BO and Microsoft.
Power BI,Oracle,Microstrategy,QlikView,Tableau .
There are BDP, Guo Yun Data (big data analysis mirror), Smart, FineBI and so on in China.
2. Data warehouse
Teradata aster data, EMC Greenplum, HP Vertica and so on.
3. Data marts
There are QlikView, Tableau, Style Intelligence and so on.
Song Jiang is a real figure in the history of our country. According to the historical records, "Song Jiang, a thief from Huainan, commi