Technological Grail Quest and some reviews too.

Post Page Advertisement [Top]

Experiments in Human Language Computer Interpretation

Seeking a more natural interface with the computer, this document represents additional forays into the world of voice recognition, speech synthesis, and artificial intelligence, mostly utilizing open software built on the Linux kernel. Over the last few years I have studied this on and off with various degrees of implementation on both Linux and Windows operating systems, and the conclusion has been that it was not quite time yet for a true HUI to exist. For a true Human User Interface to exist, it would require a set of software that would allow the computer to interpret human input in a natural way. We as humans have learned the way computers work, and I think, that is I believe that the time has come that computers should learn the way humans work instead. To accomplish this the computer must interact with humans the way that humans interact with each other, through voice, gestures, and touch. We do not just use the voice to convey messages, but also out body language, and quite often our proximity to each other or lack thereof. We may explain something as being over there, which requires a lot of knowledge that a computer would not have, as it would have to know the relative placement of everything, and also know to which direction we are indicating sometimes only with out eyes. This level of interaction incorporates the understanding of gestures (our physical indication of posture and direction, etc.), as well as understanding of location information, understanding of vocal inflection, and other factors, as well as the ability to inquire for more information in a manor that would be appropriate to human interaction.

The hardware that is available off the shelf today is just barely capable of this type of processing, at least to some degree, however it would still require large capacity of memory, especially long term memory to learn enough to be of use to people on regular basis. One thing that has come along to facilitate this is the cloud. The cloud provides the processing power and the memory capacity to allow the cloud to provide this ability, and to the widest population of users, as the cloud is accessible from any browser.

My work with the cloud has been limited and mostly through Google's version. Now I could go into this, and it would be a worthwhile waste of time, but I am not at this time going down this avenue, instead I will provide a short overview of what I feel will be the use of Google at this time in the experiment. For this experiment Google will provide the server, through Google Sites, of which I have experimented a little with, but not to the extent that I will be in the near future.

Moving on now, I will state that as Google will provide the server, it is upon the server that the applications will be built, and available for use by anyone with a browser. At first it will be a regular browser, such as found on average desktops and laptops, but eventually it will also include mobile applications that will run on any type of device including cell phones, especially those smart phones running Android. I am particularly interested in making the server applications portable, so that any person, with any device can access the server and utilize the server for whatever purposes they desire. This will be the learning curve of the server, to learn how people apply it, so that is can learn better how humans work, so it can work better with them.

Being built on open source software allow the programs to be modified to suit the applications that the users determine are appropriate. It is not about the cost of software, but about the freedom to apply the software in ways that may be novel that interest me. That they may be free of cost is just a bonus and effectively makes this possible as I have no budget to do this, only some extra time.

A major contributor to this project is MIT, and their CSAIL department. That is I am using many resources from their labs to build the applications that will hopefully provide the services I wish to experiment with. This is just a nod to their work.

There are others, and I will try to indicate them and give them credit when I can, as well as I will build a bibliography at the end of this that will hopefully list all the sources, although if I miss anyone, and someone happens to notice, could they maybe send me a note with the information so that I can acknowledge them appropriately.

Well then, half way down the first page and I have accomplished very little, so I am going to try to get into this now with more vigor.

I have been using computers for most of my adult life. As a child I grew up with the promise of computers and the potential they would have, and as yet they have not quite lived up to that promise, but they are closer. I built my first computer at about 12 or 13, I can't quite remember now when, but I do remember that it came in a box, and I had to put it together. It didn't do much, it really only lit a bunch of lites, and if I programed the paper tape right I could get it to light the lights in a particular pattern. This was fascinating, but even then I knew it wasn't anything very useful. I built it on a bread board, which made changing the circuits relatively easy, so it wasn't long before I was able to add to this computer the ability to output a new paper tape that was processed data. Now this data was mathematical in nature, additions of numbers from the numbers I have input with an initial paper tape. This was something useful, but I could have made the calculations myself easier, or done them on my adding machine quicker, so I was impressed, but not happy, yet. Along came the Sinclair. Although not the first computer available, it was one of the first ones that I had that actually did something, even if it was hardly usable. I graduated from there to the Commodore 64, and then to the Apple ][. Both of these were vast improvements over the computers I had been working with in the past. I kept updating my computers, sticking mostly with Apple computers although I did have several versions of the IBM PC and even had a PCjr which was another useless toy, but they were what was available then. I was at the time happily programming along in different versions of BASIC, and eventually in Fortran, Pascal, and Assembly. All of this dates me a bit, but it provides a bit of background into why I have continued to work with computers even with their apparent lack of ability to work with humans. I felt that computers would come to the point where they would eventually learn to work with humans, or at least that is the rationale I was made to believe from the programmers and hardware manufacturers. I learned to hack into the net early on, so I could communicate with others at the universities and other hackers who were trying to do something more with their computers than add numbers, keep addresses or recipes, or make simple "if/then" type of games. The introduction of graphics in computers spurred us on, and we (the hackers) learned to manipulate the additional computing power to do even more wonderful things, including make games like tank commander. Whatever, the point is that computers have never really delivered on their promise, and still to this day you the user must learn to use the computer, rather than the computer learning how you work or want to use it, and adapting. This is the point of this experiment, is it time for the computer to learn to work with humans, or are we still learning how to work with computers. If we are still learning to work with computers, instead of computers learning to work with humans, well then they have some way to go yet.

Speech synthesis has been around with computers for many years. It has always been a bit machine like, and very unnatural, but there just the same. My first Macintosh had this ability built in, but of course it was relatively useless, except it was "way cool jr". It would announce the time, and tell me if I had done something that the computer didn't accept, and some other things, but as far as reading a page of text, anymore than a paragraph would drive you nuts. It was slow and very metallic, so not much good, and certainly not natural. Well that has come along ways, and today computers can use parts of speech to make entire words, sentences, and read in voices that are getting much better. It can also use sampling to learn to speak from the samples and create almost life like speech, but as far as inflection, tonal qualities, well, it still has some way to go there.

Human speech is made up of different parts of sounds, and it flows in a stream, but is broken by breath and other nuances. We often hum our words, adding little bits that make the song more melodic, like ums and ahs. Although I find this irritating sometimes, it is part of speech, and even I do it sometimes, so I am not immune. Computers read the words, and the punctuation, and apply rules that are not quite natural, mostly because humans are imperfect, and they don't follow the rules properly either. Humans are different, one person might run on for a while, ignoring any breaks that some other humans might apply to the same set of words. Others are very staccato, breaking more often than would normally be in the same set of words, spoken or read by another person. This rhythmic speech has not yet been captured by speech synthesis programs I have worked with. It is close in some ways, but detectable in that there is no inflections added. So far I have not yet worked with a computer speech program that sings, and humans sing their words, they don't actually speak them. We humans use time and meter, beat and rhythm, and on occasion we even use melody to speak to one another. This does not include non verbal clues, this is just considering the verbal nuances that humans use to communicate with one another. Computers do not know words, letters, or anything like that, they only know circuits, and on and off. This limit has caused them to not be able to vocalize, as they do not have the ability to decode natural speech. It is part of this experiment to teach the computer to sing.

So what if a computer can sing, if it can't hear what the humans are saying. You see computers can take in input in many forms, and spoken words are an input, but what can a computer do with it. One thing that a computer could do with it is turn it into text, as in dictation. This is relatively simple for a computer to do, but as yet it still does it only after a lot of work training the computer to listen to your voice properly. This again is because humans sing. Computers can not compose music, it can record music, but it can not at this time compose anything that it is not programed to do. When was the last time you heard your computer hum. I mean really out of the blue, say while it was processing something. Well that is part of the problem, it requires so much processing that the computer cannot hum while it is working, because that would be more work. With multi-threaded processors you would think that a computer could do this while it was working, but that is not so, it can only process one thing at a time, be it on one processor or a hundred. Maybe it could dedicate a single processor to that process, but that would just be programming, and not humming because it couldn't think of something else to do while it did what it was supposed to be doing. Humans don't multi-thread. We do not have multi-threaded processors, and we don't multi-task, despite what some people would like you to believe. We really only process things one at a time, but often we do so in either so quick a time frame that we actually can't put it out fast enough, or that we do so so slowly that we must do something else until we can fully process that which takes to long. However we humans are able to think about many things at once, or so it would seem. Really we are able to hold many thoughts in various states of process while we process other things, so it gives the appearance of being able to do many things at once. We can write or type in this case, think about the next word we are going to write, and listen to the TV, and maybe even someone in the room who is talking, all seemingly at the same time, but do we really. The truth is we don't do it very well, and it is this distractive nature that makes it hard for us humans to complete a task on time. We cycle thoughts and processes very quickly. Think about this, we are breathing, do we think about breathing, no not really, they are automatic processes, for the most part, but yes we do, but on a sub-process level. We don't think about the keys on the keyboard if we are typing with our hands like we are taught to type. We are typing according to how we were taught, and it is only this learned behavior that allows us to type fairly quickly, some more so than others. Do we think about the words we are typing, well yes, for the brief nanosecond that we need to, otherwise we wouldn't know the letters that made up the words, or the words that made up the sentences, and so forth. We think in concepts, I did not start out to write some words, and hope that they would all come together in a cohesive document, instead I sat down with the intention of writing a document that at least attempted to understand the concept of computer human interaction. These concepts are not some sort of plan, not a program, but a lose jumble of images, songs, words, and other parts of communication that will hopefully convey a complete thought to some other human at least some type of common understanding. The other human may perceive the document differently, but if the author did their job properly, the reader will get the gist of it. Computers don't get the gist of anything. Computers don't think, they process. They take in data and using programs that have been written by humans they process that data and provide an output that is more or less anticipated. The simulation of thinking can be programed into a computer, but it will not think about something, and produce a new result that was not more or less anticipated. Sometimes due to the complexity of the code used to create the processing, and most often because of "dirty" programming, the output is not exactly what is expected, but that is not the same thing as thinking, that is usually an error. I guess it makes computers more human because they are able to make mistakes, but they are not really making mistakes, they are actually processing the data as they were programmed to do, it just may be that the data being provided to the program is not exactly what the computer is programmed for, and therefore it attempts to make the output match that input, however it is an error, and that is not the same thing as making a mistake, or thinking. So computers do not think, they process, so that means they can not talk, they can only repeat words that are programed into them for the purpose of communicating with the human user, if the user inputs the proper data. Computers don't hear, they don't sing, and they certainly can't dance.

Well everyone knows the limits of computers, and yet today you can talk to the computer almost as if it were human, what is up with that. Have you ever dealt with an IVR system? They provide the illusion of a computer listening to your spoken input and outputting useful information, including asking questions for clarification, quite often over and over, because people mumble and hum. So what else do you want the computer to do, I mean really, does it really matter if the computer can think if it can provide useful information when you ask for it. Well no really, as long as it is able to do that, then that will do, who cares if the computer can think, I certainly don't, and I do not expect that it will any time soon. What I am looking for is a partner, a computer that can help me think. I want a computer that can understand what I am thinking in concept and provide me with the data I am seeking without me providing the direct input for that data. I am looking for a computer to take in my spoken request and provide useful information that matches what I am thinking, and so far that has not been possible with the computers and programs I have used.

The programs are there, thanks to the work of technical universities and colleges and the work of many dedicated individuals that have worked to make these types of programs available. I have over the years worked with these programs and many of them are very good at taking in data through speech recognition and output useful data through speech synthesis. It does take a bit of patience on the part of the user, and often a lot of training, but let us face the fact that even humans don't speak from the get go, that is a learned skill.

The hardware for the longest time was really the hard part. The early hardware just could not process enough data fast enough to perform the operations that we humans needed to make it useful for human user interfaces. The hardware is almost available, well it is if you are using the cloud, because the average user doesn't have the resources of the cloud. To make this possible the hardware does not only have to be able to process lots of data quickly, but also remember lots of data and call up that memory to be able to connect the data to what the human is thinking. At this time short of having a super computer in your basement, the only way to do this is using the cloud. This is why for this experiment I will be employing the Google cloud in the form of Google Sites and their various applications.

The software to do this is all over the place, from various open source sources, and a lot of it is built on the MIT software suite "Speech Builder", because that is what I have had exposure to, and I know a little better how it works.

However for it to be a real human user interface, it must include a user interface, and namely a face, because that is how we humans communicate, face to face, mostly. Even when we write letters, we picture the person to whom we are writing to, so it still is a face to face communication, sort of. Therefore the experiment incorporates a AI interface as well. What I mean by that, at least in this instance, is that I mean a bot, a face on the computer that looks like a person, so it seems that you are talking to a human, complete with inflections. Ideally it would be an entire body, but that would be weird, as we are looking at a screen, and they are looking like they are looking at the screen, so it should not be anymore than head and shoulders maybe.

*This is a work in progress... more to come.

No comments:

Post a Comment

Bottom Ad [Post Page]

| Designed by Colorlib