Demonstration

The Helping H.A.N.D.S. project is a complete end-to-end system for sign language interpretation. User input is received through a microphone connected to an ESP32 running Robot Operating System (ROS). The ESP32 transmits the microphone data over a local network to a nearby laptop, where a .wav file is generated. Next, the .wav file is sent to a voice detection script, which generates a .txt file. That .txt file is then sent back to the ESP32. Each character in the .txt file is read one at a time and referenced with a pre-programmed dictionary of positions. Finally, each character is signed by the robotic hand.

Overview

The Helping H.A.N.D.S. project is based on the existing open-source Robot Nano Hand project. Using the existing Robot Nano Hand provided us with the mechanical design and assembly instructions, leaving us to assemble the hand and focus on hardware and software development.

The system block diagram below shows how all the major project components interact.

The ESP32 interfaces with the microphone and controls the movement of the hand, while the more computationally expensive speech recognition takes place on a connected laptop.

The hand is controlled by a script running on ROS on the ESP32. Characters are fed into the script one at a time from the .txt file generated by the speech detection script. Each character is associated with a pre-calibrated position for each servo motor. These positions are read from the dictionary of hand positions and fed into a queue. Periodically the position of each servo is updated to move closer to the end position.

To generate a PWM signal for each servo, the 16-channel PCA9685 servo driver is used. The servo driver is supplied with power from an external power supply, capable of delivering up to 3 Amperes of current. Finally, changing the position of the servos alters the position of each digit on the hand.

Microphone

An Adafruit I2S microphone breakout board is used to collect user input. Audio is detected by the microphone and sent to the ESP32, where the signal is then forwarded to the connected laptop.

The microphone receives power directly from the ESP32. Currently, the system can record and process three seconds of audio at a time. Audio recording is started by pressing a button on the ESP32.

Speech Recognition

After the audio signal is received by the laptop, it is converted into a .wav file and sent to the speech recognition script. Our speech recognition software is implemented with the open-source Vosk speech recognition library. The Vosk library is compatible with over 20 languages and dialects and has a lightweight, offline version, which means the system does not require access to the internet to operate.

Word Processing and Hand Control

The word processing and hand control software interfaces with the audio detection software and the robotic hand. After audio detection is complete, a .txt file is passed back to the ESP32. On the ESP32, the .txt file is broken down into individual letters, which are mapped to servo motor positions. Each letter, represented as a set of motor positions, is fed into a queue, which is periodically polled by the motor control script. The motor control script saves the new position and the previous position. Next, the motor control script gradually increments the position from the previous position to the current position to ensure smooth operation. Once the desired position has been reached, the motor control script pulls the next hand position corresponding to the next letter from the queue.

Power Management Circuit Board

The power management circuit board is the main power supply for the ESP32 and the servo motors. Using a central printed circuit board (PCB) to power the system ensures the device will be portable and eliminates the need for drawing power from a laptop or a lab bench power supply. The board receives a regulated 12 Volt DC input from...

Future Work
Nick Allison • 05/25/2023 at 18:15 • 0 comments
As mentioned in the High Level Design Decisions log, we identified two fundamental features for a sign language interpreter:
- Two hands and arms, with full control of all degrees of freedom, to closely emulate human arms.
- A robust voice detection system, which would be constantly detecting audio.
We have implemented scaled-down versions of both features:
- One hand, with two degrees of freedom in each digit and one degree of freedom in the wrist.
- A voice detection system capable of capturing three seconds of audio.
For our product to be useful as a sign-language interpreter, we would need to enhance each of the features that we developed.

On the mechanical side, we would need to improve the design of our robotic hand. Currently, the robot consists of only one hand with two degrees of freedom in each digit and no motion in the wrist. Ideally, the robot would have two separate arms with full forearm control to form more complex signs. As well, the wrists of the robot would need to have full rotation and bending motion in all directions. The fingers of our robot can only move side-to-side and bend and extend. There is no motion in the knuckles of our hand. The robot would require an additional degree of freedom in the knuckles to accurately form more signs.

Solving the mechanical limitations mentioned above would require our team to redesign the robotic hand completely. We used open-source files to 3D print our mechanical hand instead of completing the mechanical design ourselves, which would have proved difficult, time-consuming, and outside the scope of this project. Simply designing the mechanics of a robust robotic arm configuration could easily become a semester or year-long project, especially given our lack of mechanical design experience and training.

For the voice detection system, we would need to be constantly detecting audio while the system is operational. Currently, we record audio for 3 seconds and then devote resources to processing and recognizing the audio. Our system takes about 5-10 seconds to process and detect 3 seconds worth of audio, so steps would have to be taken to speed up this process. As well, our system would need to record and detect audio simultaneously, which is currently not possible.

Further, we could improve audio filtering from the existing filters built into the microphone. This could be done by utilizing two or more microphone arrays that can be used to reduce background noise and focus solely on the closest speaker.

Finally, a fully robust sign language interpreter robot would require a much larger vocabulary. Our current design only implements 24 letters of the alphabet. To improve our design, we would need to make significant changes to our robot’s vocabulary to accommodate full words, phrases, and grammar in ASL.

If all the changes specified above could be made, the design could still be further improved by adding the capacity for detecting languages other than English and translating to sign languages other than ASL.
Choosing a Speech Recognition Model
Nick Allison • 05/25/2023 at 17:54 • 0 comments
During the development of our natural language processing (NLP) block, we researched and tested several different speech recognition libraries. We evaluated the libraries based on the following criteria:
- compatibility in programming languages used in our project
- accuracy of speech recognition
- open-source availability
- price
- offline/online availability
- size and speed of the library
We have researched the following libraries, listing both advantages and disadvantages of each one based on the factors mentioned above.

Kaldi

Kaldi is an open-source speech recognition software written in C++, which works on Windows, macOS, and Linux. Kaldi’s main feature over other speech recognition software is that it’s extendable and modular: The community provides tons of 3rd-party modules. Kaldi also supports deep neural networks and offers excellent documentation on its website. While the code is mainly written in C++, it is “wrapped” by Bash and Python scripts. Kaldi also provides a Python pre-built engine with English-trained models.

However, Kaldi has only a few open-source models available, most requiring a lot of space. Kaldi also has a long installation and build process. Part of that process requires creating configuration files for each transcription, which drastically increases the complexity of the NLP.

DeepSpeech

DeepSpeech is an open-source Speech-To-Text engine using a model trained by machine learning techniques based on the TensorFlow framework.

DeepSpeech takes a stream of audio as input and converts that stream of audio into a sequence of characters in the designated alphabet. Two basic steps make This conversion possible: First, the audio is converted into a sequence of probabilities over characters in the alphabet. Secondly, this sequence of probabilities is converted into a sequence of characters. DeepSpeech provides an "end-to-end" model, simplifying the speech recognition pipeline into a single model.

DeepSpeech, unfortunately, lacks documentation and features behind other competitors on this list.

Whisper

Whisper is an automatic speech recognition (ASR) system trained on 680,000 hours of multilingual and multitask supervised data collected from the web. Using such a large and diverse dataset improves robustness to accents, background noise, and technical language. Moreover, it enables transcription in multiple languages and translation from those languages into English.

Input audio is split into 30-second chunks, converted into a log-Mel spectrogram, and passed into an encoder. A decoder is trained to predict the corresponding text caption, intermixed with special tokens that direct the single model to perform tasks such as language identification, phrase-level timestamps, multilingual speech transcription, and to-English speech translation.

Unfortunately, Whisper models have a long runtime for embedded systems.

Athena

An end-to-end speech recognition engine that implements ASR.

Written in Python and licensed under the Apache 2.0 license. Supports unsupervised pre-training and multi-GPUs training either on the same or multiple machines. Built on the top of TensorFlow.

Has a large model available for both English and Chinese languages.

However, Athena does not support offline speech recognition, which is a requirement for our project.

Vosk

Vosk is a speech recognition toolkit, which supports 20+ languages and dialects. Vosk works offline, even on lightweight devices - Raspberry Pi, Android, and iOS. Vosk has portable per-language models that are only 50Mb each and provide streaming API for the best user experience, as well as bindings for different programming languages. Vosk allows quick reconfiguration of vocabulary for best accuracy and supports speaker identification besides simple speech recognition.

Vosk also provides large documentation and technical support through GitHub and a large set of features.

We decided to use the Vosk speech recognition library...
Read more »
High Level Design Decisions
Nick Allison • 05/25/2023 at 17:46 • 0 comments
As a team, we considered many high-level design choices before building our current system. Building a sign language interpreter is an immensely complex task, which could not be perfected in the time we had. Operating under such tight time constraints, we realized early on that we would have to carefully choose which aspects of the project to complete and which to leave as future work. When deciding which areas of the project to focus on, we chose functionality that would serve as a strong foundation for further development. By focusing on basic functionality with this project, we will be able to return to this project later and easily improve upon our existing work.

A robust sign language interpreter robot would have two key features:
- Two hands and arms, with full control of all degrees of freedom to closely emulate human arms.
- A robust voice detection system, which would be constantly detecting audio.
It is worth noting that non-verbal communication is just as, if not more important in sign language as it is in spoken languages. Facial expression and body language are difficult to emulate accurately in robotics, requiring much more time, research, and resources than were available to our team.

As the foundation of our project, we chose to design and implement scaled-down versions of both those features:
- One hand, with two degrees of freedom in each digit and one degree of freedom in the wrist.
- A voice detection system capable of capturing three seconds of audio.
When deciding to implement those two features, we discussed many alternatives:
- Visually displaying detected words on a screen.
- Displaying a simulated hand or hands on a screen, which would sign the detected audio.
- Using a web server or the command line for the user to type in words or letters to be signed by the robotic hand.
- Using a web server with a list of available signs for the user to select words or letters to be signed by the robotic hand.
Each alternative design only captures one of the two key features of a sign language interpreter described above. For this reason, we split our project into two parts – the robotic hand and the audio detection.
Hand and Enclosure 3D Printing
Nick Allison • 05/25/2023 at 17:39 • 0 comments

The robotic hand and primary enclosure for the Helping H.A.N.D.S. project are all 3D printed. There is a 3D-printed base at the bottom of the hand, which contains the servo connectors and the servo driver. Above the enclosure is the 3D-printed hand. The palm of the hand contains the servos used to control the movement of the fingers and thumb. Overlaid on top of the servo motors is a piece used to route tendons used to actuate the fingers. Finally, the fingers are connected to the top of the palm.

All the 3D printed parts used in our design are from the open-source Robot Nano Hand project. We printed the parts at the makerspaces at our university and on personal 3D printers using PLA filament. The inner palm piece had to be reprinted due to poor print quality the first time.

Due to printing the parts on multiple printers, there were some scaling issues. The servo motors did not initially fit inside the palm of the hand. As well, the inner palm piece did not align with the servos in the hand and needed to be cut. Even when pieces were printed to the correct scale, they still needed to be sanded to ensure smooth operation. Our team spent a combined 30+ hours sanding parts for this project. To reduce the amount of time spent sanding if we were to redo this project, we would slightly scale up all the prints to ensure the servo motors fit inside the palm.
PCB Design
Nick Allison • 05/25/2023 at 17:30 • 0 comments

The power management PCB was designed with EasyEDA and ordered from JLCPCB. Initially, we considered using Altium to design the PCB. However, the startup and learning time required to use Altium proved too long for the timeline of our project.

We designed, ordered, and implemented one iteration of our PCB. If we had more time, we would have completed a second iteration. However, the tight timeline of the project did not allow for this.

The purpose of the PCB is to provide regulated 12 Volt and 5 Volt outputs at a maximum of 3 Amperes. The servo driver and the ESP32 receive a 5 Volt input. The board also supplies a 12 Volt output due to a previous component, which is no longer included in the project. A 12 Volt regulated input is applied to the board through a barrel jack. To supply this power, a 12 Volt wall regulator is used to step down and convert the wall power to a DC signal. A buck converter is used within the PCB to step the 12 Volts down to 5 Volts. The PCB has 57.8 x 48.5 mm dimensions and is implemented on two layers.

As mentioned above, the PCB was ordered from JLCPCB for $21.71, of which $12.21 was shipping costs. Manufacturing of the PCB took two days, while shipping took seven days. Assembly took one day. Components for the PCB were either ordered from Digikey or sourced from our university's makerspaces and various student teams.

For PCB routing, we first used the auto-route feature from JLCPCB as a starting point, which we modified to eliminate as many 90-degree bends as possible. The trace width on our PCB is 1.5 mm, and the copper thickness is 1oz/ft².

Power Analysis

Prior to designing the power management PCB, power analysis had to be done to understand the requirements of the board. The PCB is used to power the ESP32 as well as the servos. Both of those components require 5 Volts. However, the board also supplies 12 Volts. The largest current sink in our system is the servo bank. To understand the current draw of the servo driver, we tested actuating different numbers of servos at different speeds and noting the maximum values. During nominal operation, not all the servos will be moving at once, and they will be moving at a slower, steady pace. However, the servos will be under load from the hand. The estimated nominal current draw should not exceed 1 Ampere. When testing, the maximum current draw was 2 Amperes when all servos actuated simultaneously at maximum speed.

According to the above analysis, the PCB was designed for a system operating at a nominal 2 Amperes. To ensure our system is robust, an FoS of 1.5 was included. This means that all of our components were selected with a maximum current rating of at least 3 Amperes.

PCB Trace Width and Power Dissipation

The trace width for the power management PCB has to be able to withstand 3 Amperes. A trace width calculator was used to calculate the width for a 1 oz/ft2 copper thickness based on the 3 Ampere capacity. The calculator gave a result of 1.37 mm, and 1.5 mm was used in our design, giving an FoS of 1.1.

Power dissipation was also a factor in designing the PCB. The buck converter steps down 12 Volts to 5 Volts at a maximum nominal current of 2 Amperes. Therefore, the maximum nominal power dissipation is given by the equation below:

We chose a buck converter with thermal shutdown and current limit protection to dissipate this amount of power safely.

As well, the PCB has one via. A via calculator was used to verify the voltage drop and power loss in the via. Our via has a diameter of 0.8mm, with a length of 1.6 mm (the PCB's thickness). With a plating thickness of 1 oz/ft², the voltage drop is 0.000993 Volts, and the power loss is 0.00298 Watts.

View all 5 project logs

Helping H.A.N.D.S.

Description

Details

Demonstration

Overview

Microphone

Speech Recognition

Word Processing and Hand Control

Power Management Circuit Board

Files

Helping_HANDS_PCB.zip

Components

Project Logs

Collapse

Future Work

Choosing a Speech Recognition Model

High Level Design Decisions

Hand and Enclosure 3D Printing

PCB Design

Power Analysis

PCB Trace Width and Power Dissipation

Discussions

Similar Projects

POE stack light

FlexSEA: Wearable robotics toolkit

OpenFence - Digital Livestock Fencing

Particle Photon Wi-Fi and Cap-Touch Light Switch

Helping H.A.N.D.S.

Become a Hackaday.io member

Just one more thing

Description

Details

Demonstration

Overview

Microphone

Speech Recognition

Word Processing and Hand Control

Power Management Circuit Board

Files

Components

Project Logs Collapse

Power Analysis

PCB Trace Width and Power Dissipation

Enjoy this project?

Discussions

Become a Hackaday.io Member

Similar Projects

Does this project spark your interest?

Report project as inappropriate

Send message

Remove Member

Project Logs

Collapse