top | item 9003473

New services expand IBM Watson capabilities to images, speech, and more

203 points| jsstylos | 11 years ago |developer.ibm.com

106 comments

order

pesenti|11 years ago

Some context on the new services. They are built on technology that comes from IBM Research and has been moved into the Watson group in 2014. Some like speech, have been developed for more than 50 years. None of these technologies have overlap with the Watson Jeopardy stack (except for the Watson voice). We will release that stack later this year as a series of services allowing you to build a full Q&A/dialog application.

All the Watson services are still in beta but will start going GA very soon (first one next month). If you have any questions, please fire up, the Watson team is ready to answer.

pgeorgi|11 years ago

> allowing you to build a full Q&A/dialog application.

> If you have any questions, please fire up, the Watson team is ready to answer.

So that's what you built Watson for :-)

Caligula|11 years ago

I find it terribly confusing. It does not explain what instances are, do I need an instance to access some of the services?

I just want to access some services via API from my own servers. I think the documentation is not that good, there should be curl examples at least. For instance, for the STT or TTS include some curl examples.

Does the STT have speaker identification or does it output text in one stream?

I tried to access: https://gateway-s.watsonplatform.net:8443/speech-to-text-bet...

I used my bluemix l/p. It did not work. Are there other api credentials that are needed?

jcfrei|11 years ago

Do any of the Watson services allow for feedback to train them?

frik|11 years ago

Do you plan to open source some of your stuff (voice recognition, speech synthesis, gazetteers, UIMA related code)?

Watson Jeopardy itself is built on top of Apache open source stack (Apache UIMA and Hadoop): http://en.wikipedia.org/wiki/UIMA

ubercore|11 years ago

Are you working on any audio (non-speech) analysis services? I have no particular usecase in mind, but it's an area I'm always interested in!

kastnerkyle|11 years ago

What techniques are being used for text to speech? Is is something deep learning related or more standard HMM synthesis? Any paper references?

devniel|11 years ago

Great, I'm waiting for it, actually I can't do so much with the preloaded domain on Q&A service.

qeorge|11 years ago

I've been uploading the easiest photos I can find to the visual recognition demo[1], and its yet to get one right.

For example, I searched Google for "photo of girl", and found this image which seems very easy:

http://www.wagggsworld.org/shared/uploads/img/rachel-s-p-pho...

Watson says:

    Color		71%
    Human		67%
    Photo		65%
    Dog			59%
    Person		57%
    Placental_Mammal	56%
    Animal		50%
    Long_Jump		50%
Huh?

This isn't me cherry picking bad results; aside from their demos I'm not finding any photos that are accurately classified. I even tried a headshot of a person isolated on a white background, and Watson told me I uploaded a photo of "shoes".

Seriously - how is this data useful? What could I build with this level of accuracy?

Watson team - do you agree? Is this product about to get a lot better, soon, or is this considered "pretty good"?

[1] http://visual-recognition-demo.mybluemix.net/

pesenti|11 years ago

The top 3 classes in your example are actually correct - it is a color photo of a human. But we expect it to get much better over time. Only real world usage will allow us to make real improvement - and that's why we are eager to release early.

We are also believe that the first applications (e.g., classifying animals or plants or landmarks in dedicated apps) will have narrower use case that give better accuracy.

bane|11 years ago

The problem with AI systems has almost always been that they tend to be both right and wrong in ways that humans would never be.

Watson gives high confidence to it being a color photo of a human (which is a Person, and an Animal). Which is right. But the only part that a human would ever really care about is that there's another human in the picture.

It gets things wrong with a reasonable confidence for Dog, Placental_Mammal and Long_Jump...importantly, these are wrong in ways that humans would never get wrong.

Just as important are the omissions. A human would probably describe this as a picture of a girl or young woman, laughing or smiling, with curly brown hair wearing a scarf -- and maybe some other incidental information.

Of that description, Watson only got the superclass of one part correct (Human, Person) and didn't provide any of the other parts.

AI fundamentally "thinks" differently than a human, and that makes it hard for humans to use AI as a cognitive enhancement tool in the same way humans use calculators, books, writing, etc. We don't trust what an AI is doing or the answers it provides because for the information it provides, AIs tend to provide right-and-irrelevant, weirdly wrong, or omits obvious and necessary information that a human might use for informational purposes.

If humans ever encounter aliens, it's likely that their mode of thinking will be just as different. So bridging that gap, and figuring out how to make AI like this useful could be a useful endeavor.

Patrick_Devine|11 years ago

I gave it a picture of a cat (http://upload.wikimedia.org/wikipedia/commons/2/22/Turkish_V...) and got:

Photo 75% Shoes 69% Nature_Scene 69% Meat_Eater 63% Object 63% Mammal 63% Vertebrate 63% Cat 63% Indoors 62% Room 60% Person 58% Color 57% Judo 54% Person_View 53% Human 51% Leisure_Activity 50%

If you give the classifier a hint (animal) it gives: Meat_Eater 63% Mammal 63% Vertebrate 63% Cat 63%

So, clearly needs work as a general classifier, but still potentially useful.

SlipperySlope|11 years ago

Compare the Watson text-to-speech voices with Nuance ...

Watson http://text-to-speech-demo.mybluemix.net/

Nuance http://www.nuance.com/for-business/text-to-speech/vocalizer/...

I prefer the Watson version voicing a sample paragraph. Both are good enough for an application that selects on price. For a voice-first application, maybe Watson is better for TTS.

For speech to text, Nuance has been the leader, e.g. Apple's Siri. Has anyone compared IBM speech recognition to Nuance, Microsoft & Google?

picheny|11 years ago

We know we have strong core speech technology based on various comparisons we have done in the context of competitive evaluations done in conjunction with various government funded speech programs. However, our service is still very new. We could have waited for months to tune it, but our primary goal here is to solicit feedback from the community for how to make our services easier to use, especially in the context of our other platform services. We don't want to wait till the design is so mature that it is impossible to change - so any and all feedback is very welcome!

yourapostasy|11 years ago

For TTS, compare further with Vocalware and CereProc

Vocalware https://www.vocalware.com/index/demo CereProc https://www.cereproc.com/

It is getting increasingly difficult to pick one as the clear leader for "natural sounding". The results are good enough for voicing canned text, and certainly better enunciated than many thick-accented English speakers. Improvements through training can still be made in parsing the text.

For example, IBM Watson interprets "IT" as "it", in the following sentence.

Thank you for calling the IT department.

Vocalware and CereProc correctly parse that.

Who I would really like to hear opinions from are professional voice actors, though they would tend to be understandably leery to lend a hand to improve TTS. Is there a standardized form of writing text that communicates the kind of emphasis, placement of silence and warping of phonemes these actors use in their delivery to concisely convey emotion, that TTS products can adopt?

AustinG08|11 years ago

My evidence is anecdotal at best, but I have found Siri to be terrible and my "OK, Google" to be wonderful.

cypher543|11 years ago

The Watson voice is great, but I think CereProc voices sound the most natural. Also, I like that you can use them offline.

Kronopath|11 years ago

The text-to-speech is surprisingly good, but I'm amazed at one thing, and not in a good way: the Spanish voice can't pronounce the word "Español". It pronounces it as "Espanol" with a hard "n" sound. In fact, it seems to pronounce all "ñ"s as "n"s. How that kind of an oversight got into the system, I'll never know. Did no one think to check?

Edit: And to add insult to injury, the English voices do pronounce "Español" correctly!

bkeroack|11 years ago

Pricing page (which they don't make easy to find): https://console.ng.bluemix.net/#/pricing

When this was first announced I remember reading about their pricing model where they would take a percentage of app revenue. I'm glad to see they offer flat pay-as-you-go pricing now. Some of the Watson services are intriguing.

jsstylos|11 years ago

I'm on the Watson team and we're interested in learning from developers to make our APIs and documentation easier to use. Have feedback? We'd love to hear it. jsstylos@us.ibm.com Twitter: @jsstylos

bhuga|11 years ago

The text-to-speech is actually a little nicer than Siri or Cortana, but not groundbreaking. This was the only one of the 5 that I thought did well. The rest might have been better without demo pages.

For visual recognition, I used a picture of a snowmobile from http://www.1888goodwin.com/2013/11/14/what-do-you-need-to-do..., which it identified with 73% confidence as "Invertebrate".

Speech to text is a parody twitter account waiting to happen. Here's me asking it how it does with technical transcription:

How do you doing technical words.

If you were going to have to talk about get an jute cushion pull.

And you wanted to discuss the impact on a file server memory.

Issues that cross processes talk about home forks rivers slowed difficult.

cma|11 years ago

Maybe it overtrained on post-accident snowmobile riders.

jp8000|11 years ago

Make sure you use a headset, not your laptop's microphone.

humanfromearth|11 years ago

I tried using Watson a month ago without much success. I wanted to do a classification of some random text, and say that this text for example is this category. But as far as I could understand it only allows using their own datasets.

It's not possible to train their service with your data, unlike wit.ai for example. Seems obvious to me that people would want to train with their own data.

pesenti|11 years ago

Pretty much all the services that we are releasing will have some adaptation capabilities - allowing you to provide your own data, create your own models, etc - at some point. Stay posted.

FrankenPC|11 years ago

Text to speech is pretty good. http://text-to-speech-demo.mybluemix.net/?cm_mmc=developerWo...

I decided to test it a little. I copied phonem challenges and non-sensical phrasing from the web. Then I added some stuff that I know has problems from past experience.

----- Let's explore some complicated conversions, shall we? The old corn cost the blood. The wrong shot led the farm. The short arm sent the cow. How can I intimate this to my most intimate friend? Don't desert me here in the desert!. They were too close to the door to close it. The buck does funny things when does are present. Today is 1/1/2015. Today is Jan 5th, 1992. It's currently half past 12. Or 12:30PM. Twenty thousand dollars. 20,000 dollars. 20 thousand dollars. 2^5 = 32. NASA is an acronym. This ... is a pause. EmailAddress@somedomain.com.

Poiesis|11 years ago

Two things that jumped out at me:

1. No "special characters" allowed in passwords when creating an account. 2. ...where's the REST API? I've "added a service" (TTS), but I have to write a webapp to expose it over HTTP? It sure is a different experience than your typical API documentation.

jsstylos|11 years ago

1. This is good feedback, thanks. 2. The rest API docs are at https://www.ibm.com/smarterplanet/us/en/ibmwatson/developerc... You can call the service directly, though the samples show using an http webapp as a proxy to avoid exposing private service credentials. We're still working on the documentation, so feedback is helpful here. What other service REST API docs do you like, just out of curiosity? What are the features that makes that documentation useful?

karmacondon|11 years ago

So the gist of what I'm seeing in this thread is, "Watson's API services aren't very good yet, but they will get better as it collects and processes more data".

So basically, IBM is charging us to provide it with training data to make Watson useful for practical applications. Makes sense, but I can't help but feel that it would be a smarter move to skip charging entirely for now, or to use drastically reduced pricing tiers that exist only for the purpose of preventing abuse. The idea of releasing a product like this with less than impressive demos is a bit of a risk. It's not going to encourage people to use it if the demos aren't compelling, and the demos won't be compelling until a lot of people are using it. I'd err on the side of optimism here, it'll probably work out for the best, but it will be interesting to see how this goes and provide a good case study.

My other thought is that if IBM can't get sufficient training data on their own, what hope do the rest of us have? Performing classification on arbitrary data is a herculean task. People could throw literally anything at this api and will expect to get common sense results, it's nearly impossible and pushing the boundaries of what even cutting edge software can do. But if a company like IBM spends billions of dollars and their demos still end up generating mostly confusion and complaints... This kind of open ended "AI" might be more difficult than even the most conservative experts thought.

EDIT: As an after thought, the real value here isn't so much software as it is pooled training data. Facebook has been able to identify human faces in photos for years, speech-to-text and concept modelling have all been around for a long time. What's difficult is getting the labelled data necessary to distinguish between "is this a picture of a person or a picture of a cat?". Watson is great and it seems like IBM has made an investment in acquiring and collecting the data necessary to do that. But their big play here might be to build a consumer friendly enough product that their users contribute the rest of that data for them over the next several years, building an aggregate data set that is worth as much or more than the software itself. Again, will be interesting to see how it plays out.

jsstylos|11 years ago

All of the Watson services are free in beta. (Bluemix, through which the services are accessed, requires a credit card after 30 days, but doesn't charge you for use of the beta Watson services.)

We wanted to get the services into peoples hands early, even though we're still working on them, rather than wait until we had a perfect product. There's a tradeoff here, but we figure that we can improve the services faster and better with public usage and feedback than we could in private isolation.

Since they're free, hopefully people will be able to have some fun playing around with the services, also!

taliesinb|11 years ago

> What's difficult is getting the labelled data necessary to distinguish between "is this a picture of a person or a picture of a cat?". Watson is great and it seems like IBM has made an investment in acquiring and collecting the data necessary to do that.

Are they using more than ImageNet? The ImageNet dataset(s) are not hard to get.

flamedoge|11 years ago

Real value is heuristics. or learning algorithms to refine heuristics. Data is always growing.

walterbell|11 years ago

Should the training data set be open-source?

corin_|11 years ago

@IBM people: Is there any information available yet either regarding future pricing, or regarding timeline for getting pricing information?

cabirum|11 years ago

Visual recognition has some room for improvement

http://i.imgur.com/V59IeQH.png

aroopPandya|11 years ago

hey, try changing the classifier from "All" to "Scene". It does much better.. and stay tuned we will release some more api's on top of visual recognition to allow for image labeling..

enricobruschini|11 years ago

I've been developing a product with Watson from within the Partner Ecosystem, some of those capabilities are pretty useful. Others, sometimes, are kind of confusing, creating a broad overpopulated constellation of Watson-based APIs inside Bluemix.

ConfuciusSay|11 years ago

Now you can buy back stock algorithmically in the cloud!

jcoffland|11 years ago

Don't pay a company to do what can be done with a library.

z3phyr|11 years ago

This

>>Speech to Text : This application only works in recent versions of Chrome supporting HTML5 audio capture

picheny|11 years ago

Yeah, Chrome currently seems to have the best support for audio capture.....

anonbanker|11 years ago

Can we all just drop the charade and start calling Watson SkyNet already?

taf2|11 years ago

How do I signup and pay them money?

tparikh|11 years ago

Watson services on Bluemix are currently in beta. You can use the beta services at no charge, even after your 30 day Bluemix trial, although you will need to provide a credit card to Bluemix. You will not incur any charges unless you use any of the production services.

niels_olson|11 years ago

In other news, Watson will be RA'd at the end of the month.

johnward|11 years ago

If anyone is not going to be RA'd it's Watson group. There is a lot riding on the success of Watson.