https://DevOpsCloud.io -- Cloud Monk Losang Jinpa

Natural Language Processing with Spark Preface

“Natural language processing (NLP) is a field of study concerned with processing language data. We will be focusing on text, but natural language audio data is also a part of NLP. Dealing with natural language text data is difficult. The reason it is difficult is that it relies on three fields of study: linguistics, software engineering, and machine learning. It is hard to find the expertise in all three for most NLP-based projects. Fortunately, you don’t need to be a world-class expert in all three fields to make informed decisions about your application. As long as you know some basics, you can use libraries built by experts to accomplish your goals. Consider the advances made in creating efficient algorithms for vector and matrix operations. If the common linear algebra libraries that deep learning libraries use were not available, imagine how much harder it would have been for the deep learning revolution to begin. Even though these libraries mean that we don’t need to implement cache aware matrix multiplication for every new project, we still need to understand the basics of linear algebra and the basics of how the operations are implemented to make the best use of these libraries. I believe the situation is becoming the same for NLP and NLP libraries.” (NLPwSprk 2020)

“Applications that use natural language (text, spoken, and gestural) will always be different than other applications due to the data they use. The benefit and draw to these applications is how much data is out there. Humans are producing and churning natural language data all the time. The difficult aspects are that people are literally evolved to detect mistakes in natural language use, and the data (text, images, audio, and video) is not made with computers in mind. These difficulties can be overcome through a combination of linguistics, software engineering, and machine learning.” (NLPwSprk 2020)

“This book deals with text data. This is the easiest of the data types that natural language comes in, because our computers were designed with text in mind. That being said, we still want to consider a lot of small and large details that are not obvious.” (NLPwSprk 2020)

“A few years ago, I was working on a tutorial for O'Reilly. This tutorial was about building NLP pipelines on Apache Spark. At the time, Apache Spark 2.0 was still relatively new, but I was mainly using version 1.6. I thought it would be cool to build an annotation library using the new DataFrames and pipelines; alas, I was not able to implement this for the tutorial. However, I talked about this with my friend (and tutorial copresenter) David Talby, and we created a design doc. I didn’t have enough time to work on building the library, so I consulted Saif Addin, whom David had hired to work on the project. As the project grew and developed, David, Claudiu Branzan (another friend and colleague), and I began presenting tutorials at conferences and meetups. It seemed like there was an interest in learning more about the library and an interest in learning more about NLP in general.” (NLPwSprk 2020)

“People who know me know I am rant-prone, and few topics are as likely to get me started as NLP and how it is used and misused in the technology industry. I think this is because of my background. Growing up, I studied linguistics as a hobby — an all-consuming hobby. When I went to university, even though I focused on mathematics, I also took linguistics courses. Shortly before graduating, I decided that I also wanted to learn computer science, so I could take the theoretical concepts I had learned and create something. Once I began in the industry, I learned that I could combine these three interests into one: NLP. This gives me a rare view of NLP because I studied its components first individually and then combined.” (NLPwSprk 2020)

“I am really excited to be working on this book, and I hope this book helps you in building your next NLP application!” (NLPwSprk 2020)

“An important part of the library is the idea that people should build their own models. There is no one-size-fits-all method in NLP. If you want to build a successful NLP application, you need to understand your data as well as your product. Prebuilt models are useful for initial versions, demos, and tutorials. This means that if you want to use Spark NLP successfully, you will need to understand how NLP works. So in this book we will cover more than just Spark NLP API. We will talk about how to use Spark NLP, but we will also talk about how NLP and deep learning work. When you combine an understanding of NLP with a library that is built with the intent of customization, you will be able to build NLP applications that achieve your goals.” (NLPwSprk 2020)

SYI LU SENG E MU CHYWE YE. NAN. WEI LA YE. WEI LA YE. SA WA HE.