Wednesday, November 18. 2009On Text Data Quality
Seth Grimes posted a fantastic article on Text Data Quality yesterday. A must read for anyone in this space. The article points to some of the text quality issues I have mentioned in my last two blogs. Text is in a league of its own when it comes to data quality. And the more you have to work with social media generated data, the more you will run into non-standard text and the need for text cleansing. I presented a workshop where I talked about "The Ten Transgressions of Text" at the Text Analytics Summit in June: 1. UPPERCASE/lowercase 2. Miss-spelings 3. A.C.R.O.N.Y.M.S 4. Shrt-hnd or clipped text (e.g. hmm tink nid >2 twitter acs; els msgs all jumbled up btwn personal & thots! dilemma!) 5. Pr☺f@nity 6. !!NOISY TEXT!! 7. /*Punctuation*\ 8. ♪ Voice ♫ 9. Email / Attachments 10. Poor grammar Customers ask me if we can automatically remove profanity from documents and, yes, WE CAN! My interest in the sorts of shortened/clipped texts that you get in text messages or via Twitter is huge. There is a lot text analytics users and vendors can do to work with this data. Terms like "cul8r" (see you later), or "LOL" (laughing out loud / lots of love) could be expanded into their intended forms, mapped to other synonyms(we provide ontologies to handle this), or left as is. When a shortened term can mean several different things depending on context, that's when the linguistics can help. I see a big need for including this new 'language' into standard language dictionaries. Adhering to the standard rules of grammar looks like a thing of the past. As traditional print media loses favor, so will standard grammar in social media (blogs, micro-blogs such as FaceBook, Twitter, Bebo etc.). I'm excited to see how other natural language processing technologies will change to accommodate the new breed of user. Wednesday, November 4. 2009Google, Bing, Twitter and Instant Web Search
I read an interesting article this morning entitled Companies race to offer instant Web search, including Twitter by USA TODAY reporter Jon Swartz . The recent announcements that Google and Bing will include Twitter in search results raise some intriguing questions – and challenges. Ever seen a Tweet like this?
hmm tink nid >2 twitter acs; els msgs all jumbled up btwn personal & thots! dilemma! What search methodology is going to pick that up? Why, why not and who cares? And what do you do with it anyway? Given Twitter information is publicly available, it makes sense that using natural language search is extremely valuable for finding information - whether you are monitoring brand information or looking for answers. I bet that one large US airline would love to see the many, many YouTube hits about breaking a musical instrument go away! The fact still remains that you will get a lot of search results that you still have to deal with. Say you work for Target, and you want to monitor the Twittersphere. Try searching Twitter for the word "target". Results may include target run rates for the recent NZ cricket test match, the US retail store, target marketing references or target practice. A quick search on "target" on thesaurus.com shows target used as both a noun and a verb. Target is also a proper noun. A graphical link from thesaurus.com to visual thesaurus gives a hint of the many ways the word target can be used. Including text analytics capabilities means you can group similar documents together, thus creating categories around the document meaning. This allows you to automatically weed out the information you don't need and get to the information you need. Furthermore, the inclusion of text analytics in social media analysis means you can analyze comments, determine relevance, find promoters, detractors and much more! Stay tuned as we explore this topic more – it’s not going away! Friday, October 30. 2009What's in a name? YOUR BRAND!!!
With the explosion in social media sites turning us all into digital producers as opposed to digital consumers, businesses are left in the unenviable position of needing to rapidly adapt to the new generation of customers who take their opinions global, and in a heartbeat!
These new age digital producers are leaving their opinions on social media sites at an exploding rate. It pays for companies to keep track of what their customers are saying and taking the appropriate action where needed. For companies that are looking to analyze blogs and tweets, etc. having a brand/product name that is commonplace and has multiple parts of speech in a dictionary will make that analysis job much more difficult. Consider the word Jaguar for example. Is it a car? A plane? An animal? A US sports team? An international sports team? A BBC show? A Twitter search on Jaguar may pull back information you aren't so interested in. Conversely, having an unusual name (not ideal according to traditional marketing strategy) means it's easier to find information about you, your company, or your products. The Google name/brand would be a great example. Do a google on "Google" and all of the results (I did spend a little time wading through these) look to be all about the Google brand!!! Food for thought as you consider product naming going forward. And, who knows, there's probably someone out there that has named their son or daughter "Google", but surely so few of them that they wouldn't impact any brand analysis. Friday, July 10. 2009Why customer intelligence will fail without text mining (cont.)
My colleague Mark Chaves, product manager for SAS Customer Intelligence responded to my earlier post “Why customer intelligence will fail without text mining,” with some strong opinions of his own. And remember – he’s a marketing guy, too! Read on:
I agree with Manya’s comments and wanted to add that advertising as a medium through which marketers communicate is evolving, not diminishing.Coming back to my comment about marketers beware, tying promotions to the right kind of consumer reviews could be extremely valuable. Text mining can analyze consumer reviews to help identify the appropriate comments and segment(s) to go after. Why customer intelligence will fail without text mining
I'm doing my weekly round-up of text mining/unstructured data/information management news. Having lived in numerous continents around the world, I like to make sure my information hunting is equally intercontinental. Different cultures have different slants on topics.
This morning's search led me to an article posted by the New Zealand Herald entitled "How Can YouTube Survive?" A section of the article mentioned insider's technology blog, TechCrunch, and a guest blog post entitled "Why Advertising Is Failing On The Internet" written by Eric Clemons, Professor of Operations and Information Management at the University of Pennsylvania. A very interesting read, and also a very provocative one (validated by many comments to the blog post). According to this excerpt from the NZ Herald article, Clemons "argued that the way that we're using the Internet has shattered the whole concept of advertising. We need no encouragement to share our opinions online regarding products and services and offer them star ratings; as a result, we're much more likely to look for personal recommendations from other customers than wait for a gaudy advert to beckon us wildly in the direction of a company website or online store. He claims we don't trust online advertising, we don't need online advertising, but above all we don't want online advertising." Based on my personal Internet shopping habits, I agree! I'd much rather see personal testimony about a product in addition to (or instead of) marketing collateral. This personal testimony has becoming a new form of marketing. It would serve marketing professionals well to pay attention. Understanding individuals' commentaries about products helps marketers better understand consumer reaction to the four P's of the marketing mix: product, price, placement and promotion. This evolution of marketing influencers is exactly what makes text mining a pivotal technology for this generation. It provides the ability to gauge those huge volumes of Web-based consumer reactions in an automated, consistent manner. And then you can actually do something about it -- or with it! Wednesday, June 17. 2009Text Speak
I just posted a tweet to my @ManyaMayes Twitter account. In order to get my message across, in 140 characters or less, I had to shorten my text. This is a very common practise for mobile phone users who send text messages that look a lot like a foreign language. My Mum writes messages that are so clipped that I have trouble deciphering them! As a BlackBerry user, I send email messages but I rarely send SMS messages. I've spent many years making sure I write messages that are easy for audiences to understand. It's going to take me a while to get used to writing clipped text (writing in text speak) as part of my job. It goes against much of my professional training to write like this: u no wot u no & u don't no wot u don't
How does text mining handle this? One approach would be to specify synonyms for these clipped terms: u = you no = know wot = what But "no" and "know" are both valid dictionary entries, so this will immediately cause a follow on problem since surely not all occurrences of "no" should be replaced with "know". Deciding which occurrences of "no" should be replaced with "know" is aided by using additional context of the document. Boolean and linguistic rules can help with this. It can be difficult to solve data quality problems like this and typically solutions are specific to both the data and the application. For example, the way you would replace R&R would depend on whether the data came from a forum for military personnel talking about upcoming "rest and relaxation" or whether it was a warranty report describing "repair and replace" for a defective part or other... Thursday, June 11. 2009Sentiment Analysis Overview
I saw the following comment on Twitter yesterday about sentiment analysis limitations and decided it would make a good topic for a blog update:
@concannon: Can anybody explain to me why automated sentiment analysis is anything more than flaky, snake-oil BS? The technology just isn't ready yet. I’m going make a bold statement here – automated sentiment analysis using the right methodology – is actually superior to human sentiment analysis. Bear with me and read through. The available approaches to analyzing sentiment/satisfaction vary based on the data provided. I would categorize the approaches based on the availability of three types of data: 1. Customer feedback (free-form text) with customer ranked satisfaction (discrete value), like Amazon product reviews. 2. Customer feedback (free-form text) with manually ranked satisfaction (discrete value), where human readers subjectively score the content. 3. Customer feedback only, no ranked satisfaction, as with blog posts and comments For the first data type, machine learning algorithms do a good job of measuring overall sentiment (say, +ve/neutral/-ve). Examples of data suitable for this approach are: survey data and product review forums. The problem is that not a lot of text is gathered this way (with a purpose in mind). Even if it is, the machine learning algorithms struggle with distinguishing positive elements from negative. It's one thing to know if a customer is dissatisfied, it is another to know about what! Given no customer ranked satisfaction, it is possible to build a statistical model using a sample of manually ranked documents, then automatically score the remaining unranked documents. Not many companies are willing to do this. It also doesn't truly represent the customer’s opinion - just the reader’s interpretation of what the customer thinks. For the third option, customer opinion with no ranking, you can derive sentiment from the context of the text using natural language processing or NLP. This data is most common and hence so are the approaches to analyzing it. It’s not easy, but it’s the sweet spot for gain value from the massive volumes of consumer generated text. One widely available, cheap technology assigns an overall positive or negative sentiment based on assigning positive or negative values to individual words then summing them to get an overall sentiment rating. This approach fails in situations like the following: "It's not bad" (two negatives that actually suggest a positive) "I'm not going to say this sucks" (sarcasm or humor) “The keyboard is impossibly small but the display is the best I’ve seen.” (combination) The most recent advances in sentiment analysis technology use a combination of techniques: (1) statistics (2) rule-based definitions and (3) human intervention, e.g. a final review of the machine scoring. The results are less expensive than human-only sentiment analysis, but more consistent. Why? Because the automation adds consistency, while the human verifies the result. When put in the right workflow then it clearly increases scalability by a substantial factor. Teragram, a division of SAS, announced the Teragram Sentiment Analysis Manager at the Text Analytics Summit early June. More to come on that! The Phenomenon that is Twitter
I mentioned the buzz around Social Media Analysis (SMA) at the Text Analytics Summit. If we took all the speakers content and produced a tag cloud, Twitter would have the biggest 'floor space'. I don't think there was a single presentation that did NOT mention Twitter.
While doing some background research for SMA, I ran across an article entitled State of the Twittersphere, that HubSpot blogged about just this week (that's @HubSpot for the 55.5% of Twitter users that don't follow anyone). There's a lot of really great Twitter usage statistics in this report. It's amazing how many people sign up with Twitter but are very inactive (I have multiple Twitter accounts and one is definitely contributing to inactivity). I'm more interested in those users that are very active. It would be good to connect with other users who post materials similar to my own (like a document recommendation system) and Text Mining can definitely help with this. I'd also like to see something like a “users who posted materials like this, also connected with these users:" - like the recommendations you get from Amazon. Ranking the tweets of users you follow based on content would also be fabulous. Some users post about both personal and business related materials. I personally prefer not to read the personal posts (sorry y'all). Having personal tweets, or topics less interesting to me appear further down the list (if at all) would be another desirable feature... I have a bunch of other recommendations for Twitter product management - as do many other Twitter users. How about using Text Analytics/Text Mining for managing product requirements... Wednesday, June 3. 2009Text Analytics Summit Review
I am back in my office after a thoroughly enjoyable time at the annual Text Analytics Summit in Boston. I have to admit I was in my element rubbing shoulders with thought leaders, end users, analysts and press.
Jim Cox and I arrived Sunday afternoon to attend two preconference presentations: "Text Analytics for Dummies" by Conference Chair, Seth Grimes of Alta Plana, and a vendor comparison presentation by Nick Patience of technology industry analyst company, 451group. The themes dominating the conference were: sentiment analysis, social media analysis, social network analysis, voice of the customer, eDiscovery, Web search, visualization, SaaS and Cloud. We heard keynote presentations: “Discover and Drive Brand Activity in Social Networks” by Emmanuel Roche, Teragram and Jim Cox, SAS “A Tale of Two Search Engines – The Evolution of Search Technology and the Role of Social Networking in Marketing” – Usama Fayyad, Open Insights “Sentiment Analysis” – Bing Liu, University of Illinois We also saw end user case studies, analyst and end user panels, a Text Analytics Market Report by IDC, vendor presentations and a group of very active roundtable discussions. sentiment analysis. Key capabilities focused on product and feature level sentiment extraction. Sentiment is also considered a key component to Social Media Analysis. While many vendors play in the social media analysis space, not many vendors provide all the necessary capabilities on their own. Tracking social networks, reach, promoters, detractors, key influencers/key opinion leaders (KOL) and key themes/trends were put forth as valuable. Voice of the customer / customer feedback continued to play a key role of text input to text analytics models that look to find key issues being reported by customers. eDiscovery is probably the top text analytics application area at this year’s summit. Several law firms were represented and the ability to mine legal documents crucial. Web search in relation to advertising was shown to be very powerful due to the user indication of intent. Advertising based on Web search and user behavior improves click-through ratio (CTR) by an average of 652%! Also mentioned was the mammoth effort required to tag massive volumes of rapidly changing Web content. There are numerous Web sites who employ user bases to do this for them. The new look of Web search goes far beyond providing lists of documents. Document facets, snippets, images, sentiment and more can be derived from search results. Sue Feldman of IDC indicated the Text Analytics and Search market is moving in direct opposition to the current economic market. The analysts represented at the summit all agreed that visualization of huge volumes of text should be an area that all vendors pay more attention to. Other sentiments echoed by the analysts included the desirability of Software as a Service (SaaS) applications, and the overwhelming need (and analyst amazement) that Text Analytics vendors had not provided Cloud Computing yet. On the whole, conference goers imparted a great amount of valuable information. I will wrap up my commentary with these overheard statements: “Search doesn’t help you discover things you are unaware of.” “TA technology can solve problems we don’t even know about yet.” “Text analytics puts humanity into statistics.” (Thanks to Chris Bowman for that one!) “The most common search on Monster is: Find me a job!” (followed by another that Blog Administrator refuses to post) "Missing a piece of a puzzle is frustrating, can anyone spot the missing piece to my wardrobe?" [shoes] Additional conference commentary can be found on twitter.com #textsummit. My colleague Anne Milley also summarized Day 1 and Day 2 wrote about it on our sascom voices blog. Curt Monash, we missed you this year! SAS and Teragram would like to thank conference goers. It was a pleasure seeing you all!
Posted by Manya Mayes
in Manya Mayes
at
14:28
| Comments (0)
| Trackbacks (0)
Defined tags for this entry: conference, sentiment analysis, teragram, text mining, text mining summit
Tuesday, May 5. 2009Text analytics sales on the up
BI Network columnist Seth Grimes says 2008 global text analytic sales exceeded $350 million and expected 2009 growth is at least 25 percent, with SAS one of the large players in this specific technology segment: “Market Outlook for Text Analytics”
Wednesday, April 22. 2009The changing face and pace of text mining
For a rather long time I have been talking of the convergence of text-related technologies such as search, text mining, text analytics, machine learning, voice analysis, video mining, enterprise content management (ECM), business intelligence (BI) and business analytics (BA) etc. The industry continues to change with the merger of three text analytics companies into one this week.
To me, this merger serves to validate SAS' direction in the unstructured space where our strategy is to take unstructured data right across the platform so organizations can have access to the full depth and breadth of SAS capabilities with a complete range of tools, products and solutions. Some day users will not consider text to be any different from standard structured database fields. Analytic applications will automatically roll up text and other unstructured information. IT departments and Business Reporting users no longer need be restricted to partial views on limited data. Data sources in the future can be gathered from Tweets, emails, dynamic web 2.0 sources and then integrated with the traditional IT data warehouses before they are cleansed and analyzed rigorously -- resulting in better decisions and greater impacts. SAS is ready to assist you in this exciting journey – and we applaud those who see the necessity of integration across the IT Storage, Analytics , reporting and line of business users. Saturday, March 21. 2009SAS Global Forum: ready, set, GO!
We are looking forward to interacting with those of you that make the annual pilgrimage next week to SAS Global Forum 2009, this year in Washington, DC. Personal preparations here on SAS Campus in Cary this week have included completely removing and reinstalling SAS on my laptop.
I'm more than excited to have the opportunity to interface with SAS users and highlight the additional capabilities Teragram technologies are giving our SAS analytics offerings. In addition, there are a number of SAS Text Miner/SAS Content Categorization presentations for your viewing pleasure. SGFtextanalyticstalks.pdf The Teragram booth will be situated right beside the SAS Text Miner booth, enter the exhibit area and turn left -- we are looking forward to seeing you! And don't forget, if you Twitter (as I do) follow me at @manyamayes. To follow a much wider conversation during the conference, use the hash tag #SGF09. See you there!
Posted by Manya Mayes
in Manya Mayes
at
14:18
| Comments (0)
| Trackbacks (0)
Defined tags for this entry: teragram; sas global forum; twitter; demo;
Wednesday, March 11. 2009Social networks eclipse email in popularity!
Based on research published by Nielsen on Monday,"Social Networking’s New Global Footprint," social networks are now more popular than email. Plenty of customer feedback can be garnered from the Web with a lot more ease than less publicly available email. Yet not all of this information is tremendously useful (see my previous post on the Skittles social media experiment). But, really, who wants to wade through all of that information by hand!?! Discovering and categorizing valuable feedback is something that can be automated using SAS Text Miner and SAS Content Categorization.
Friday, March 6. 2009Skittles Social Media Experiment meets SAS Text Miner
Earlier last week, Skittles created a lot of buzz around the relaunching of their web site as part of a social media marketing campaign to direct Twitter comments containing the word "skittles" to the Skittles home page. For a little Friday afternoon frivolity, I decided to download some of the Twitter comments to analyze automatically using SAS Text Miner . Given my experience with analyzing web text, and having read the reports about the Skittles campaign, I was sure I would be subjected to colorful language and other less than savory comments - Web 2.0 at its best AND worst - I was right. Additional related opinion has also been posted by Dave Thomas on his Social Media at SAS blog.
I downloaded 1400 posts about Skittles from the Twitter social networking site. It was not enough to cover all of the campaign and the buzz it created, but it is a start. Some of my initial results show topics about: ![]() -- Vodka Skittles! [I'm hoping they'll make some Margarita ones]; -- Religion, Viagra, Rihanna, and taste the rainbow (although not all together); -- the campaign itself. This visual (click on it for a better look) gives you a glimpse at the breath of information contained in the postings (people really thought their postings were interesting!)... I spray painted over some bad language so as to avoid offending anyone. I plan on exploring the data a little more with Text Miner, then maybe (given time) adding SAS Content Categorization to the mix allowing me to create a taxonomy using advanced linguistic techniques. Thursday, December 11. 2008Text mining and Web search - what you can't avoid!
During the many years I've been looking at customer data with SAS Text Miner, I've run into a few situations where I've wondered if I need danger money and here's why:
I had an interesting experience analysing some consumer security software Web search results of the 'less than savory' kind. Colorful language is common in the younger generation (18 yrs +/- ~6 yrs) Web sites, and colorful Web sites are just plain common. An innocent search can lead you to places you never intended to go. While you are likely to come across this kind of html data for text mining/text analytics at some point (like I did), I am always pleased to see that Text Miner creates its own segment for this data and I can treat it as noise and continue my analysis focusing on analyzing more useful trends.
(Page 1 of 2, totaling 17 entries)
» next page
|
ABOUT THE TEAM I'm Manya Mayes, SAS Chief Text Mining strategist. On this blog, my colleagues, friends and I discuss unstructured text and understanding the voice of the customer. Plus a few more things. Read more about me and the other contributors here. ContributorsQuicksearchSyndicate This BlogShow tagged entriesa2009 advertising alone analytics artifical intelligence auto burden conference consumer opinion content categorization Crime criminal CRM customers denmark Discussion Forum email event extraction extraction FBI global forum information retrevial interview data john elder kdd M2009 manufacturing misspellings safety sas sentiment sentiment analysis significance Skittles social media social networking success stories supervised learning svd synonyms synsets teragram teragram; sas global forum; twitter; demo; text text mining textspeak topic detection Twitter visualization YouTube
|
