![]() We can now come to define parameters involved in the problem - the dependent and independent variables.Ĭleaning the Data and Extracting Features This has the shape of a classification problem. The outcomes that we are interested in include food, ambience, service, deals, and worthiness. The words in the review are essentially the causes that affect the outcomes, which are the aspects of the restaurant being good or bad. I started to think of a way for the machine to automatically tell us if the review the is about some certain aspects of the business being good or bad. The problem is that the data is just too large, to understand it without reading. The review data potentially contain a lot of interesting information about the restaurants. I am particularly interested in the reviews of the restaurants. Among the most crowded categories, we can see that Restaurant also receives a lot of attention on the average in terms of number of reviews per business. Less that half of that number is Shopping. From the plot below, we can see that Restaurant is the most popular categories with highest number of reviews. There is a huge number of reviews: 1569264. ![]() The table below shows different kinds of data that we are provided, and the amount of each. The final dataset that we run the classifier on around 10,000 reviews of the 3 restaurants that receive the highest numbers of reviews.įirst, I have a quick look at how big is the dataset that we have. The training and test dataset is reused from the author of, which consists of around 8,000 and 2,000 data points. The final best classifier is the one that uses the Decision Tree algorithm, that gives the precision and recall of 0.59 and 0.56, respectively. I reproduced the classifier that was an ensemble of the binary classifiers, each of which was a Decision Tree or a Random Forest. I used R only for summarizing the results, and R Markdown for writing the report and plotting graphs. My implementation is mostly in Python, making use of libraries such as Scikit-learn for machine learning, NLTK for text processing, pandas for data frame manipulation. My contribution in this analysis includes implementing the framework recommended by the author of, and applying the framework on the dataset provided In the Capstone project to observe the quality trend of the restaurants over time. I approach the problem in a slightly different way, in that while they see the problem as a multi-label classification problem, I see it as a multioutput classification problem. I decided to adopt their question and follow their framework. The problem can be stated as follows.Ĭlassifying users reviews into relevant categories of food, ambience, service, deals, and worthiness, to help people quickly grasp the main ideas of the reviews without reading the whole text.īeside having the good question, the student team also laid out the framework for the analysis, from which I believed I could learn a lot. Also, the answer to the problem would be more useful to both the business owners and the users. ![]() I came across a much more interesting NLP classification problem on the user reviews defined by a student team. Running out of ideas, I went on to do some research on how similar questions were being asked and addressed by more experienced people.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |