Wednesday, November 18. 2009On Text Data Quality
Seth Grimes posted a fantastic article on Text Data Quality yesterday. A must read for anyone in this space. The article points to some of the text quality issues I have mentioned in my last two blogs. Text is in a league of its own when it comes to data quality. And the more you have to work with social media generated data, the more you will run into non-standard text and the need for text cleansing. I presented a workshop where I talked about "The Ten Transgressions of Text" at the Text Analytics Summit in June: 1. UPPERCASE/lowercase 2. Miss-spelings 3. A.C.R.O.N.Y.M.S 4. Shrt-hnd or clipped text (e.g. hmm tink nid >2 twitter acs; els msgs all jumbled up btwn personal & thots! dilemma!) 5. Pr☺f@nity 6. !!NOISY TEXT!! 7. /*Punctuation*\ 8. ♪ Voice ♫ 9. Email / Attachments 10. Poor grammar Customers ask me if we can automatically remove profanity from documents and, yes, WE CAN! My interest in the sorts of shortened/clipped texts that you get in text messages or via Twitter is huge. There is a lot text analytics users and vendors can do to work with this data. Terms like "cul8r" (see you later), or "LOL" (laughing out loud / lots of love) could be expanded into their intended forms, mapped to other synonyms(we provide ontologies to handle this), or left as is. When a shortened term can mean several different things depending on context, that's when the linguistics can help. I see a big need for including this new 'language' into standard language dictionaries. Adhering to the standard rules of grammar looks like a thing of the past. As traditional print media loses favor, so will standard grammar in social media (blogs, micro-blogs such as FaceBook, Twitter, Bebo etc.). I'm excited to see how other natural language processing technologies will change to accommodate the new breed of user. Wednesday, November 4. 2009Google, Bing, Twitter and Instant Web Search
I read an interesting article this morning entitled Companies race to offer instant Web search, including Twitter by USA TODAY reporter Jon Swartz . The recent announcements that Google and Bing will include Twitter in search results raise some intriguing questions – and challenges. Ever seen a Tweet like this?
hmm tink nid >2 twitter acs; els msgs all jumbled up btwn personal & thots! dilemma! What search methodology is going to pick that up? Why, why not and who cares? And what do you do with it anyway? Given Twitter information is publicly available, it makes sense that using natural language search is extremely valuable for finding information - whether you are monitoring brand information or looking for answers. I bet that one large US airline would love to see the many, many YouTube hits about breaking a musical instrument go away! The fact still remains that you will get a lot of search results that you still have to deal with. Say you work for Target, and you want to monitor the Twittersphere. Try searching Twitter for the word "target". Results may include target run rates for the recent NZ cricket test match, the US retail store, target marketing references or target practice. A quick search on "target" on thesaurus.com shows target used as both a noun and a verb. Target is also a proper noun. A graphical link from thesaurus.com to visual thesaurus gives a hint of the many ways the word target can be used. Including text analytics capabilities means you can group similar documents together, thus creating categories around the document meaning. This allows you to automatically weed out the information you don't need and get to the information you need. Furthermore, the inclusion of text analytics in social media analysis means you can analyze comments, determine relevance, find promoters, detractors and much more! Stay tuned as we explore this topic more – it’s not going away! Friday, October 30. 2009What's in a name? YOUR BRAND!!!
With the explosion in social media sites turning us all into digital producers as opposed to digital consumers, businesses are left in the unenviable position of needing to rapidly adapt to the new generation of customers who take their opinions global, and in a heartbeat!
These new age digital producers are leaving their opinions on social media sites at an exploding rate. It pays for companies to keep track of what their customers are saying and taking the appropriate action where needed. For companies that are looking to analyze blogs and tweets, etc. having a brand/product name that is commonplace and has multiple parts of speech in a dictionary will make that analysis job much more difficult. Consider the word Jaguar for example. Is it a car? A plane? An animal? A US sports team? An international sports team? A BBC show? A Twitter search on Jaguar may pull back information you aren't so interested in. Conversely, having an unusual name (not ideal according to traditional marketing strategy) means it's easier to find information about you, your company, or your products. The Google name/brand would be a great example. Do a google on "Google" and all of the results (I did spend a little time wading through these) look to be all about the Google brand!!! Food for thought as you consider product naming going forward. And, who knows, there's probably someone out there that has named their son or daughter "Google", but surely so few of them that they wouldn't impact any brand analysis. Friday, October 16. 2009SAS is hiring - Text Analytics is Growing!
Things have been mighty busy here at SAS lately. Please accept our apology for the infrequent blog postings. The good news is that there have been some really interesting customer engagements going on that required our full attention. I just returned from working an exhibit with INFORMS in San Diego, where we had so many people ask about text analytics we ran out of handouts. Its the buzz on social media and the idea of mining the words and text we use in those communications that seems to catching on - Big Time! What was traditionally an optimization operations research event, has expanded into all sorts of analytics with NLP and Text Analytics being cited in a dozen different sessions.
Now - for the subject line of this blog entry........ I am delighted to announce that we are expanding our Text Analytics team. We are looking for someone to perform pragmatic product management duties along with our existing Text Miner R&D experts, the Teragram Employees, and our Text Analytic consulting and sales engineers. Text Analytics has grown to encompass much more that the SAS Text Miner product which was originally launched as an "add on" piece to our Enterprise Miner offering. Today Text Analytics includes more than predictive modeling so we are opening up new positions for experts -- thus the open position of Product Manager for SAS Text Analytics. If you have Bachelor's degree in computer science, applied mathematics, statistics, or a related quantitative discipline and 5 years of experience in product management, consulting, or a related function in the software industry we welcome your application. Expertise in the application of text analytics methodologies is required so I encourage those of you reading this blog to help spread the word to those who run one or more of the text analytics software packages now on the market. The instructions on how to apply are provided by our Human Relations department here the job number for the TEXT ANALYTICS position is - 09001816. This position is located at SAS Headquarters Cary, NC near the RDU airport. I invite you to browse all our jobs as posted on the main sas web page which you can find by selecting CAREERS on the top horizonal menu bar, and then clicking on professional opportunities. You will see that we are also seeking experienced consultants for our Advanced Analytics Lab. So those of you familiar with our SAS software may wish to apply to those jobs also. Its exciting our these technologies and the people who run them are now IN DEMAND Meanwhile we on the SAS team are having fun enhancing Text Miner to make the next release of version 4.2 available early December 2009. This will be the release with many Teragram capabilities weaved inside the Text Miner product. More on that topic in a future blog. Thanks for reading and ...... Thanks for carrying the message out to your boss and your clients that Yes indeed , "Text Analytic Technologies are ready TODAY to be applied and make impacts in our world". Friday, September 11. 2009Man versus Machine ---Logical operator "&" or "V"?
Last Month at the SIGIR meeting in Boston , one of the presentations given by a Teragram customer attracted notice in a twitter post.
The NY Times automated the tagging of topics for their online website by their implementation of software to automatically build their indexes. However, as the tweet points out - the machine has NOT replaced Man because the newspaper continues to rely on MANUAL entries by people who maintain and build the New York Times Index, a more traditional index. ![]() Stephen Arnold wondered in his blog why an organization might continue to require human labor on a task machine can now perform? Could be political resistance to change? or perhaps the machine fails sometimes? Perhaps the employees without skills to be reassigned are in fact prime for the next round of employees to see a "pink slip" as budgets get cut? Mr. Arnold's ideas are all valid possibilities and I've seen cases of each in my experiences transferring technology from the research lab into business production environments. Those who put a stake in the ground and step forward to be the first to serve as role models for how text analytics can carry their business forward - ought to pause and consider their own culture. Since the original question was about a Teragram customer implementation, I asked Saratendu Sethi , the director of Engineering at Teragram to share what he's pbserved in his consulting engagements. Here is his response. First of all, even if automatic categorization guarantees >99% accuracy, for a News company, it is absolutely critical to not portray any wrong information for even 1%. This can only be verified by having humans validate the categorization results. They are doing that on a subset of articles, e.g. front-page articles. Secondly, new topics constantly emerge in the coverage of current events. Even the best text mining algorithms can’t achieve perfection in spotting emerging topics because these algorithms are usually based on processing past content. Also, the definition of emerging topics is based on human perceptionwhich is affected by time, location and the type of entities involved in the event. Therefore, these topics have to be manually spotted and added to documents/taxonomy while they are emerging. Having said that, the following are four benefits that Teragram categorization achieves for New York Times: (1) If two people are asked to suggest categories on the same document on their own, they are always going to come up with different categories. Automatic categorization enforces consistency and removes human subjectivity by automatically suggesting them categories. (2) Automatic categorization saves time because it is easier to ask editors to select appropriate categories from an automatically generated list rather than having them to think about them. With automatic categorization, I can just spend few seconds but with manual categorization I have to use few minutes to read the content and decide the appropriate topics (3) Entity extraction (e.g. identifying person, locations, etc), which doesn’t require much human input, is automated. (4) Automatic categorization enables New York Times to process all their past archives. Currently New York Times re-processes all their past 25years of content with updated taxonomies every few months. 4a. The human editors are only reviewing articles for current day (~500-1000 articles/day) whereas the past archives might include 100K articles/year. 4b. If “swine flu” was only identified as a News topic in 2009, then automatic categorization allows NYT to find out what other news appeared in past. So what do you conclude from this post? How would YOU answer the question posed in the title of this entry ? It is cost effective to apply MAN "and" Machine together -- or has the science progressed enought to replace MAN ? Is it time to choose and go with Man "or" Machine approach when deciding about becoming more efficient? Saratendu answers with the "AND" operator -- and thats the answer I prefer too --cause i'm not comfortable letting those sci fi robots and machines take over my world. How about you?
Posted by Mary Grace Crissey
in Mary Grace Crissey
at
00:00
| Comments (0)
| Trackbacks (0)
Defined tags for this entry: artifical intelligence, content categorization, extraction, information retrevial, teragram, twitter
Thursday, September 3. 2009Text Analytics for Ye of Little Faith
While at a customer site last week, I presented our text analytics capabilities (text mining, content categorization, sentiment and crawling). Before the meeting proper, one attendee admitted that he wasn’t a text analytics believer.
![]() I guess he was warning me - not to wave my hands vaguely referring to some "higher power" of fancy math and linguistics. At 30k feet in the air (the return flight home), I realized my missed opportunity. I wish I'd casually explained to him, how the car he is driving today is safer today thanks to text analytics. It means a lot to me that text analytics makes such a difference in people’s lives – even if they don’t realize it or "believe". Over time the dismissive "yeah right" doubters will see the obvious. Artificial Intelligence and Text Analytics are due respect today as Science rather than Sci Fi entertainment All that said, I expect that our Text Frontier followers are already believers, so my blog post might be in vain. You may have felt the rush of adrenaline when discovering the treasure of a rare "Ah-Ha moment" when insights are found buried in text. Let's get creative and brainstorm how to "evangelize" others and bring them into the fold. Gradually people are taking notice -- our profession is progressing and earning trust. Future posts here will highlight customer success stories SAS and Teragram have documented. Please drop a comment to our blog here and share any TM related jokes or favorite one liners that you use to turn heads to build curiosity over this field of analytics. Better yet -- prove it with your own innovative implementation at your office. Seeing is Believing!
Posted by Manya Mayes
at
00:00
| Comments (4)
| Trackbacks (0)
Defined tags for this entry: auto, customers, manufacturing, safety, sentiment, significance, success stories, webcrawler
Tuesday, August 4. 2009Keeping SAFE with Text Analytics
Where has the time gone? Here in Texas the hot sun is still roasting, while local retailers are promoting their back to school items on sale. For several weeks now, I've had a blog idea brewing from a talk entitled "WHY COUNT CRIME WHEN YOU CAN PREVENT IT?" You'll see why it caught my interest by noting the image in the top left corner of this slide.
![]() Dr Colleen McCue shows how handwritten police notes and data taken from phone calls can be analyzed to predict future locations and potential criminal events. She'll be speaking live at M2009 but for those who you want to hear her sooner - you can view the archived presentation at your leisure. Her engaging explanation illustrates how Analytics are helping Police departments do their job of keeping neighborhoods safer. According to Dr McCue "Automated text analytic software could be game changing in information intensive tasks (e.g., a major case will have thousands of tips – the DC Sniper case was compromised in some ways because people focused on the “white van” – the software won’t get tired, bring bias, or forget what it just read). It also has tremendous potential in culling through a lot of interview data (e.g., the detainee data), particularly when you have disparate sources that are geographically diverse but likely connected (through common operational goals, training, etc). " Three cheers for the FBI, local police - and your local government -- all holding future potential customer success stories for text analytics. Meanwhile you don't want to miss the recent white paper Text Mining for Safety describing how the Oil and Gas industry sees Text Analytics as the answer to moving beyond simply tracking accidents (counting them) to REDUCING hazards on the job. Text Analytics is keeping us safe on the job and at home. Tuesday, July 28. 2009Calling all Lone Rangers
With SAS analytics; many of you are breaking ground as you strive to deliver more value from textual data. Data mining has matured into an accepted practice for Customer Relationship Management teams in Telco, Finance and Marketing. I'd go as far to say that its become essential to survival for most large companies across the globe. Text Mining, as you readers are well aware, is not yet as popular, with many employers assigning just one or two of you with responsibility for text mining.
It can be a burden to struggle alone in a silo without anyone to bounce ideas or brainstorm with. To make it easier for you to connect with peers who share a passion for these technologies – we set up a discussion forum on the topic of SAS Enterprise Miner and SAS Text Miner, two months ago. While SAS employees may participate on these discussions, this forum is not meant to replace the SAS Technical Support help center. ![]() Another excellent way for you to get feedback on your work is to respond to the SGF Call for presentations 2010 (Seattle). Honestly, one of my favorite things about SAS is - our innovative customer implementing software on your real world challenges. Only a few text related topics have surfaced in the discussion forum to date, so I’m writing this blog to encourage more of you to join in and set up your profile. Please accept my invitation to post questions, experiences, and thoughts on best practices. Friday, July 10. 2009Why customer intelligence will fail without text mining (cont.)
My colleague Mark Chaves, product manager for SAS Customer Intelligence responded to my earlier post “Why customer intelligence will fail without text mining,” with some strong opinions of his own. And remember – he’s a marketing guy, too! Read on:
I agree with Manya’s comments and wanted to add that advertising as a medium through which marketers communicate is evolving, not diminishing.Coming back to my comment about marketers beware, tying promotions to the right kind of consumer reviews could be extremely valuable. Text mining can analyze consumer reviews to help identify the appropriate comments and segment(s) to go after. Why customer intelligence will fail without text mining
I'm doing my weekly round-up of text mining/unstructured data/information management news. Having lived in numerous continents around the world, I like to make sure my information hunting is equally intercontinental. Different cultures have different slants on topics.
This morning's search led me to an article posted by the New Zealand Herald entitled "How Can YouTube Survive?" A section of the article mentioned insider's technology blog, TechCrunch, and a guest blog post entitled "Why Advertising Is Failing On The Internet" written by Eric Clemons, Professor of Operations and Information Management at the University of Pennsylvania. A very interesting read, and also a very provocative one (validated by many comments to the blog post). According to this excerpt from the NZ Herald article, Clemons "argued that the way that we're using the Internet has shattered the whole concept of advertising. We need no encouragement to share our opinions online regarding products and services and offer them star ratings; as a result, we're much more likely to look for personal recommendations from other customers than wait for a gaudy advert to beckon us wildly in the direction of a company website or online store. He claims we don't trust online advertising, we don't need online advertising, but above all we don't want online advertising." Based on my personal Internet shopping habits, I agree! I'd much rather see personal testimony about a product in addition to (or instead of) marketing collateral. This personal testimony has becoming a new form of marketing. It would serve marketing professionals well to pay attention. Understanding individuals' commentaries about products helps marketers better understand consumer reaction to the four P's of the marketing mix: product, price, placement and promotion. This evolution of marketing influencers is exactly what makes text mining a pivotal technology for this generation. It provides the ability to gauge those huge volumes of Web-based consumer reactions in an automated, consistent manner. And then you can actually do something about it -- or with it! Friday, June 26. 2009Travels to Paris and Copenhagen this week!
SAS is sending data & text mining experts (including Teragram employees) over the ocean to Europe for two different events this week.
We'll have a booth in the exhibit hall at KDD09 Sunday through Wednesday. If you are one of the lucky ones attending KDD, mark your program to attend the panel discussion to listen to Dr Wayne Thompson from SAS talk about Emerging Trends in Open Standards and Cloud Computing for Data Mining . Even if you don't make it to the KDD conference to personally pick up the new book authored by the conference chair John Elder , you can experience our Software-on-Demand version of data mining by buying his book, "Handbook of Statistical Analysis and Data Mining Applications." ![]() The second event where you can find us is at the SAS conference devoted to ANALYTICS called A2009 in Denmark July 1,2. The program is online. There you can read the abstract about the a Swedish Insurance firm that studied hand written notes collected by police officers and security guards during 2004-2007. At both shows, you'll be able to see live demos of our software and pick up a hard copy of the most recent fact sheet, highlighting the enhancements that are now available with the TEXT MINER 4.1 version that was made available to customers 5 weeks ago. Those of you reading this blog that haven't yet seen it may want to read the fact sheet on our SAS 9.2 release of Text Miner on the SAS website. What does your summer hold for you? Do you have travel plans to shows or conferences with text analytics tracks or sessions included? Please add a comment to this blog and do share! Friday, June 19. 2009IDG asks 131 executives about their IT spend priorities for 2009
A recent survey by IDG Research Services, highlights Business Process Automation as an IT priority.
Some of the findings include: • More than 2/3 of respondents are automating most of their core business processes • Another 21% are moving towards this goal • 87% consider BPA to be a critical or important IT priority • 87% see a connection between unified communications and process automation, • More than one third envision communication technology being incorporated into BPA in the future Even though I have not spoken with Joe Staples and Brad Herrington from "Interactive Intelligence", I share their observation that many in today’s economic environment, are trying to streamline operations and do more with less. As organizations seek ways to be more efficient, both in the front office and back office, we might position our technology as a tool for automating business processes leading to improved business results. Have any of you motivated your IT department to spend $$ on Textual Analytic software or recruit support for your research program with this approach? Its rare for BPA companies to include automating manual processes surrounding words or unstructured content via TEXT Technologies. After I watch the webcast on June 25 and get the white paper - I'll let you know if any mention of Natural Language processing or Content Categorization or Sentiment Analysis is made. Meanwhile, it's up to all of us to continue to promote awareness and implement Text Analytics into real world situations. We aren't talking about a dream of some vague emerging futuristic possibility , the time is now to include text communication in with traditional data sources of computer processing applications. When one combines text analytics with mathematical optimization and predictive analytics, we can go well beyond merely automating business processes by improving and discovering entirely new processes leading to a sustainable future. Thanks for reading. Wednesday, June 17. 2009Text Speak
I just posted a tweet to my @ManyaMayes Twitter account. In order to get my message across, in 140 characters or less, I had to shorten my text. This is a very common practise for mobile phone users who send text messages that look a lot like a foreign language. My Mum writes messages that are so clipped that I have trouble deciphering them! As a BlackBerry user, I send email messages but I rarely send SMS messages. I've spent many years making sure I write messages that are easy for audiences to understand. It's going to take me a while to get used to writing clipped text (writing in text speak) as part of my job. It goes against much of my professional training to write like this: u no wot u no & u don't no wot u don't
How does text mining handle this? One approach would be to specify synonyms for these clipped terms: u = you no = know wot = what But "no" and "know" are both valid dictionary entries, so this will immediately cause a follow on problem since surely not all occurrences of "no" should be replaced with "know". Deciding which occurrences of "no" should be replaced with "know" is aided by using additional context of the document. Boolean and linguistic rules can help with this. It can be difficult to solve data quality problems like this and typically solutions are specific to both the data and the application. For example, the way you would replace R&R would depend on whether the data came from a forum for military personnel talking about upcoming "rest and relaxation" or whether it was a warranty report describing "repair and replace" for a defective part or other... Thursday, June 11. 2009Sentiment Analysis Overview
I saw the following comment on Twitter yesterday about sentiment analysis limitations and decided it would make a good topic for a blog update:
@concannon: Can anybody explain to me why automated sentiment analysis is anything more than flaky, snake-oil BS? The technology just isn't ready yet. I’m going make a bold statement here – automated sentiment analysis using the right methodology – is actually superior to human sentiment analysis. Bear with me and read through. The available approaches to analyzing sentiment/satisfaction vary based on the data provided. I would categorize the approaches based on the availability of three types of data: 1. Customer feedback (free-form text) with customer ranked satisfaction (discrete value), like Amazon product reviews. 2. Customer feedback (free-form text) with manually ranked satisfaction (discrete value), where human readers subjectively score the content. 3. Customer feedback only, no ranked satisfaction, as with blog posts and comments For the first data type, machine learning algorithms do a good job of measuring overall sentiment (say, +ve/neutral/-ve). Examples of data suitable for this approach are: survey data and product review forums. The problem is that not a lot of text is gathered this way (with a purpose in mind). Even if it is, the machine learning algorithms struggle with distinguishing positive elements from negative. It's one thing to know if a customer is dissatisfied, it is another to know about what! Given no customer ranked satisfaction, it is possible to build a statistical model using a sample of manually ranked documents, then automatically score the remaining unranked documents. Not many companies are willing to do this. It also doesn't truly represent the customer’s opinion - just the reader’s interpretation of what the customer thinks. For the third option, customer opinion with no ranking, you can derive sentiment from the context of the text using natural language processing or NLP. This data is most common and hence so are the approaches to analyzing it. It’s not easy, but it’s the sweet spot for gain value from the massive volumes of consumer generated text. One widely available, cheap technology assigns an overall positive or negative sentiment based on assigning positive or negative values to individual words then summing them to get an overall sentiment rating. This approach fails in situations like the following: "It's not bad" (two negatives that actually suggest a positive) "I'm not going to say this sucks" (sarcasm or humor) “The keyboard is impossibly small but the display is the best I’ve seen.” (combination) The most recent advances in sentiment analysis technology use a combination of techniques: (1) statistics (2) rule-based definitions and (3) human intervention, e.g. a final review of the machine scoring. The results are less expensive than human-only sentiment analysis, but more consistent. Why? Because the automation adds consistency, while the human verifies the result. When put in the right workflow then it clearly increases scalability by a substantial factor. Teragram, a division of SAS, announced the Teragram Sentiment Analysis Manager at the Text Analytics Summit early June. More to come on that! The Phenomenon that is Twitter
I mentioned the buzz around Social Media Analysis (SMA) at the Text Analytics Summit. If we took all the speakers content and produced a tag cloud, Twitter would have the biggest 'floor space'. I don't think there was a single presentation that did NOT mention Twitter.
While doing some background research for SMA, I ran across an article entitled State of the Twittersphere, that HubSpot blogged about just this week (that's @HubSpot for the 55.5% of Twitter users that don't follow anyone). There's a lot of really great Twitter usage statistics in this report. It's amazing how many people sign up with Twitter but are very inactive (I have multiple Twitter accounts and one is definitely contributing to inactivity). I'm more interested in those users that are very active. It would be good to connect with other users who post materials similar to my own (like a document recommendation system) and Text Mining can definitely help with this. I'd also like to see something like a “users who posted materials like this, also connected with these users:" - like the recommendations you get from Amazon. Ranking the tweets of users you follow based on content would also be fabulous. Some users post about both personal and business related materials. I personally prefer not to read the personal posts (sorry y'all). Having personal tweets, or topics less interesting to me appear further down the list (if at all) would be another desirable feature... I have a bunch of other recommendations for Twitter product management - as do many other Twitter users. How about using Text Analytics/Text Mining for managing product requirements...
(Page 1 of 3, totaling 34 entries)
» next page
|
ABOUT THE TEAM I'm Manya Mayes, SAS Chief Text Mining strategist. On this blog, my colleagues, friends and I discuss unstructured text and understanding the voice of the customer. Plus a few more things. Read more about me and the other contributors here. ContributorsQuicksearchSyndicate This BlogShow tagged entriesa2009 advertising alone analytics artifical intelligence auto burden conference consumer opinion content categorization Crime criminal CRM customers denmark Discussion Forum email event extraction extraction FBI global forum information retrevial interview data john elder kdd M2009 manufacturing misspellings safety sas sentiment sentiment analysis significance Skittles social media social networking success stories supervised learning svd synonyms synsets teragram teragram; sas global forum; twitter; demo; text text mining textspeak topic detection Twitter visualization YouTube
|
