Tesseract for Transcribing Messy Handwriting
Throughout the past month studying Data Science at Lambda School, myself along with four other data scientists and a group of web development students were tasked with improving an existing web application called Story Squad. Story Squad was developed with the intent of removing children from screens and trying to put them in imagination mode more often. The game requires that students complete a reading, writing and drawing task throughout a week. The game then pairs users with similar ability levels together to compete against another team of users. This game repeats on a weekly basis.
Currently, the game uses Google Cloud Vision API (GCV API) in order to transcribe and evaluate texts or images the users submit. The problem with using the GCV API is that it becomes expensive as the application scales. Story Squad is a small startup at the moment so they are looking to reduce costs in any way possible. In order to reduce costs, the stakeholders have requested that the GCV API be replaced with tesseract. Tesseract is free software that can be used to transcribe texts so this would be more cost effective for the stakeholders.
One of the original concerns the team had when starting to implement tesseract into the Story Squad project was that tesseract did not have the same safe search features that the GCV API has in place. The safe search features are a critical component as they identify violent and adult content in the photos submitted by users. As the target audience is children typically between 8–12 years old, we would not like explicit content to be allowed on the platform. Another challenge the team identified early on was that tesseract may struggle with transcribing children’s handwriting as it can be messy and not aligned properly. One last concern was the misspelling of words written by children. This could affect a future implementation of a neural network to determine the creativity level of a story.
Expanding Tesseracts Capabilities
The first challenge the team faced was getting set up to work with tesseract locally. There were lots of dependencies we had to install in order to get tesseract running and able to train on data. To overcome these challenges, the team frequently met on zoom calls and collaborated to make sure everyone was working through challenges at the same pace. Along with working together, we were able to reach out to a former member of the Story Squad team who had worked with tesseract and she was able to provide us with assistance. In order to provide assistance to future teams, a Google Doc was created which documented the challenges we faced and how to overcome them. This should help quickly onboard new team members.
The next problem the team tried to solve was using tesseract to transcribe stories written by children. This was a difficult task as children’s handwriting is messy and difficult for most humans to read. To begin this process, the stakeholders provided us with handwritten samples from students between third and fifth grade. These handwritten stories were accompanied by the actual transcription of the story.
Prior to analyzing the pieces provided by students, preprocessing was done to make sure the lighting of the image submitted was even. This preprocessing feature is critical as it make the text provided stand out on the page more. The students handwritten work was then cut into segments using box file editing. The team worked through each of the segmented pieces and typed out the appropriate translation. These two components would be used to train tesseract to perform better on children's handwriting.
My main contribution to the project was incorporating a spell check feature. In order to do this, the Jamspell library was used. Jamspell was selected as it works well when there are spelling errors with contextual data around it to help determine which work the user was trying to input. To test out Jamspell, a notebook was created that compared the correct and incorrect spellings of 450 words. The Google Colab notebook was then converted to a jupyter notebook which proved to be more of a challenge then I expected as Jamspell is difficult to run locally. In order to run Jamspell, I first had to install swig which is used to connect programs that are written in C++ to be used with python. I had originally installed the most recent version of swig, version 4.0.2, but it seems that Jamspell only works with swig 3.0. Once the version of swig was changed, Jamspell ran properly and was able to correct the spelling of words with an error rate of about 42%. In order to improve the performance, Jamspell should be used for spelling errors with complete sentences so it has more context about the correct word.
Whats Next?
Currently, the story squad application works and students are able to read and submit creative pieces. From a data science perspective, the project includes a partially trained tesseract model that will improve as it is trained on more handwriting provided by students. The team was able to train the model by using a dataset of handwritten stories from third through fifth graders provided by the stakeholders. Transcriptions for the stories were created and box file editing was used to break them into segments along with transcripted segments. These box file edits can be used to improve the performance of the model by training it which increases the performance of how tesseract transcribes children’s handwriting.
The data science team worked with the web development team to take in a list of student id’s and generate clusters of students which can be broken into pairs. This is useful as it will be able to pair together students with similar ability levels. Once the pairs are generated, the teams are sent back to the web development team so they are able to display the teams on the frontend.
In the future, users will be able to submit their creative pieces which should be stored and then used to improve the performance of the tesseract model. This may be challenging as each story used to train the model needs a human to provide a correct transcription of the text so it is a time consuming process.
I enjoyed working in a team setting as it was quick to iterate on ideas and work through technical challenges together. It was also helpful to receive feedback from my peers when submitting pull requests and coming to daily meetings with new ideas. Some feedback I received that helped me improve was to keep the goals of the stakeholder as the top priority. This led me to begin exploring as many replacement options for the GCV API as possible for all the features we were using. One feature that seems easily replaceable by an open source option is spell correction using Jamspell, which can now be implemented to save the stakeholders capital. This project helped me further my career goals by teaching me how to get onboarded quickly to a large codebase for a project that is already started. When joining the workforce, this will be critical as many companies will already have made significant developments. Learning tesseract was also useful as this is a highly desired skill for a data scientist.