Sentdex Analysis

Sentiment Analysis - What is it?

At the most basic level, sentiment analysis is the attempt to derive the emotion or 'feeling' of a body of text. The field of sentiment analysis and opinion mining usually also involves some form of data mining to get the text. Many times, the field of natural language processing is also used.

How does sentiment analysis work?

There are many ways that people analyze bodies of text for sentiment or opinions, but it usually boils down to two methods.

1. "Bag of Words" Model:

This model focuses completely on the words, or sometimes a string of words, but usually pays no attention to the "context" so-to-speak. The bag of words model usually has a large list, probably better thought of as a sort of "dictionary," which are considered to be words that carry sentiment. These words each have their own "value" when found in text. The values are typically all added up and the result is a sentiment valuation. The equation to add and derive a number can vary, but this model mainly focuses on the words, and makes no attempt to actually understand language fundamentals.

2. Using Natural Language Processing, and the attempt to truly "understand" the text:

This model attempts to have the machine actually understand the sentences structures, context, and is more focused on the succession of a string of words. Usually, this structure requires the machine to have understanding of grammar principles. To do this, Natural Language Processing (NLP) techniques are used to tag parts of speech, named entities, and more, in order to actually understand the "language" of the text, and not just look for target words.

Which form of sentiment analysis is better?

Both models can actually end up being very intense in their processes. Both models can also have similar accuracy in the end, so the choice between them usually depends on area of expertise or interest.

The "Bag of Words" models usually have massive amounts of machine learning that are built in, and required. This often comes in the form of neural networks or support vector machines. The idea of these is to recognize patterns in data, in order to value words. Keep in mind though, the "Bag of Words" model is always seeing things as "objects" at best. There's never actually any attempt at understanding of grammar or sentence structure behind the text, besides pre-defined strings and words.

The attempts to truly understand the text, while possibly not requiring as much data to be machine learned against, is equally as difficult. Usually, the attempt to fully understand the text involves more "context" and word "succession" so to speak, without pre-built rules for exact structures. This model can indeed wind up using just as much machine learning as the past, but then, in my opinion, it winds up straying more to the side of "bag of words" as it does this.

In a pure "Bag of Words" model, the following sentence would likely be thought of as being the same:

"That's true, I am not a fan."

"That's not true, I am a fan."

In the end, it is more likely that a hybrid of these two is going to be the best option.

How accurate is sentiment analysis?

In its current state, sentiment analysis is already very accurate. Most highly machine-trained systems can reach 80% accuracy fairly commonly. Here at Sentdex, we use sentiment analysis for all sorts of things that can also be validated for accuracy. Some examples include comparing stock price sentiment to actual stock price.

How is Sentiment Analysis Accuracy Measured and Trained?

It is important to mention here, however, that sentiment analysis accuracy comes in two forms:

1. Polarity - Was that bit of text "positive" or "negative." Black and white here.

2. Degree - HOW positive or negative. This is where the most disagreement will come, and also grey area where even some people will disagree on polarity.

There are many ways that accuracy can be measured. One popular method is by using ratings websites. Places like Amazon, IMDB, Rotten Tomatoes, and Google have ways for users to rate a website numerically (or sometimes with "stars" ... but numerically) and then lets them write something. This makes it relatively easy to measure the sentiment of the text and then compare to the actual rating that user gave it. From here, not only can accuracy of the sentiment analysis and natural language processing algorithms be assessed, but also machine learning can take place. This works by systematically locating the patterns that caused error.

Will Sentiment Analysis ever be 100% accurate, or close?

Probably not, but that is not meant to be a bad thing. This will not be because people aren't smart enough to eventually make computers that really understand language. Instead, this is really just plain impossible, seeing as how it's rarely the case that 80% of people agree on the sentiment of text.