This article will share with you how to use Python to crawl popular comments on NetEase Cloud Music. It has a certain reference value. Friends in need can refer to it
Preface
p>Recently, I have been studying content related to text mining. It is said that a clever woman cannot make a meal without straw. If you want to conduct text analysis, you must first have text. There are many ways to obtain text, such as downloading ready-made text documents from the Internet, or obtaining data through APIs provided by third parties. But sometimes the data we want cannot be obtained directly because there is no direct download channel or API for us to obtain the data. So what should we do at this time? A better way is to use a web crawler, which is to write a computer program to pretend to be a user to obtain the desired data. Taking advantage of the efficiency of computers, we can obtain data easily and quickly.
About crawlers
So how to write a crawler? There are many languages ????that can be used to write crawlers, such as Java, php, python, etc. I personally prefer to use python. Because Python not only has a built-in powerful network library, but also has many excellent third-party libraries. Others have directly built the wheel, and we can just use it. This brings great convenience to writing crawlers. It is no exaggeration to say that you can actually write a small crawler using less than 10 lines of Python code, while using other languages ??requires writing a lot more code. Being concise and easy to understand is a huge advantage of Python.
Okay, without further ado, let’s get to the main topic today. NetEase Cloud Music has become very popular in recent years. I am a user of NetEase Cloud Music and have been using it for several years. I used to use QQ Music and Kugou. Through my own personal experience, I think the best features of NetEase Cloud Music are its accurate song recommendations and unique user reviews (for the record!!! This is not a soft article) , non-advertising! Just my personal opinion, please don’t comment! Often there will be some comments under a song that have received many likes. Coupled with the fact that NetEase Cloud Music put selected user reviews on the subway a few days ago, NetEase Cloud Music's reviews have become popular again. So I want to analyze NetEase Cloud’s comments and discover the patterns, especially the unique characteristics of some hot comments. With this purpose, I started crawling NetEase Cloud comments.
Network library
Python has two built-in network libraries, urllib and urllib2, but these two libraries are not particularly convenient to use, so here we use a well-received third party Library requests. Using requests, you can achieve more complex crawler work such as setting up agents and simulating logins with just a few lines of code. If pip is already installed, just use pip install requests to install it.
The Chinese document address is here=(organic)|utmcmd=organic; playerid=81568911; __utmb=94650624.23.10.1490672820",
'Connection':"keep-alive", < /p>
'Referer':'/' }
# Set proxy server
proxies= {
'ments(url):
p>
hot_comments_list = []
hot_comments_list.append(u"User ID User Nickname User Avatar Address Comment Time Total Likes Comment Content")
params = get_params(1 ) # First page
encSecKey = get_encSecKey()
json_text = get_json(url,params,encSecKey)
json_dict = json.loads(json_text) < /p>
hot_comments = json_dict['hotComments'] # Hot comments
print("***There are %d hot comments!" % len(hot_comments))
for item in hot_comments:
comment = item['content'] # Comment content
likedCount = item['likedCount'] # Total number of likes
comment_time = item['time'] # Comment time (timestamp)
userID = item['user']['userID'] # Commenter id
nickname = item[ 'user']['nickname'] # Nickname
avatarUrl = item['user']['avatarUrl'] # Avatar address
comment_info = userID + " " + nickname + " " + avatarUrl + " " + comment_time + " " + likedCount + " " + comment + u""
hot_comments_list.append(comment_info)
return hot_comments_list
< p> # Capture all comments of a certain songdef get_all_comments(url):
all_comments_list = [] # Store all comments
all_comments_list.append (u"User ID User Nickname User Avatar Address Comment Time Total Likes Comment Content") # Header information
params = get_params(1)
encSecKey = get_encSecKey()
p>json_text = get_json(url,params,encSecKey)
json_dict = json.loads(json_text)
comments_num = int(json_dict['total']) p>
if(comments_num % 20 == 0):