https://DevOpsCloud.io -- Cloud Monk Losang Jinpa, Ph.D., MCSE/MCT, GitOps DevOps Engineer

Data Science at the Command Line - Obtain, Scrub, Explore, and Model Data with Unix Power Tools, 2nd Edition, by Jeroen Janssens

Return to Python Data Science, R Data Science, Bash Data Science, Data Science, Data Science and DevOps - Data Science and SRE - Data Science and CI/CD, Cloud Native Data Science - Data Science and Microservices - Serverless and Data Science, Data Science and Security - Data Science and DevSecOps, Functional Data Science, Data Science and Concurrency, Data Science and Databases, Data Science and Machine Learning, Data Science Bibliography, Data Science Courses, Data Science Glossary, Awesome Data Science, Data Science GitHub, Data Science Topics

Fair Use Source: B09CX29SDJ (DatSciCmdLn 2022)

Book Summary

This thoroughly revised guide demonstrates how the flexibility of the command line can help you become a more efficient and productive data scientist. You'll learn how to combine small yet powerful command-line tools to quickly obtain, scrub, explore, and model your data. To get you started, author Jeroen Janssens provides a Docker image packed with over 80 tools–useful whether you work with Windows, macOS, or Linux.

You'll quickly discover why the command line is an agile, scalable, and extensible technology. Even if you're comfortable processing data with Python or R, you'll learn how to greatly improve your data science workflow by leveraging the command line's power. This book is ideal for data scientists, analysts, and engineers; software and machine learning engineers; and system administrators.

Obtain data from websites, APIs, databases, and spreadsheets
Perform scrub operations on text, CSV, HTM, XML, and JSON files
Explore data, compute descriptive statistics, and create visualizations
Manage your data science workflow
Create reusable command-line tools from one-liners and existing Python or R code
Parallelize and distribute data-intensive pipelines
Model data with dimensionality reduction, clustering, regression, and classification algorithms

Who This Book Is For

This book makes just one assumption about you: that you work with data. It doesn’t matter which programming language or statistical computing environment you’re currently using. The book explains all the necessary concepts from the beginning.

It also doesn’t matter whether your operating system is Microsoft Windows, macOS, or some flavor of Linux. The book comes with a Docker image, which is an easy-to-install virtual environment. It allows you to run the command-line tools and follow along with the code examples in the same environment as this book was written. You don’t have to waste time figuring out how to install all the command-line tools and their dependencies.

The book contains some code in Bash, Python, and R, so it’s helpful if you have some programming experience

Reviews

“Traditional computer and data science curricula all too often mistake the command line as an obsolete relic instead of teaching it as the modern and vital toolset that it is. Only well into my career did I come to grasp the elegance and power of the command line for easily exploring messy datasets and even creating reproducible data pipelines for work. The first edition of Data Science at the Command Line was one of the most comprehensive and clear references when I was a novice in the art, and now with the second edition, I'm again learning new tools and applications from it.” — Dan Nguyen, data scientist, former news application developer at ProPublica, and former Lorry I. Lokey Visiting Professor in Professional Journalism at Stanford University

“Despite what you may hear, most practical data science is still focused on interesting visualizations and insights derived from flat files. Jeroen's book leans into this reality, and helps reduce complexity for data practitioners by showing how time-tested command-line tools can be repurposed for data science.” — Paige Bailey, principal product manager code intelligence at Microsoft, GitHub

“It's amazing how fast so much data work can be performed at the command line before ever pulling the data into R, Python, or a database. Older technologies like sed and awk are still incredibly powerful and versatile. Until I read Data Science at the Command Line, I had only heard of these tools but never saw their full power. Thanks to Jeroen, it's like I now have a secret weapon for working with large data.” — Jared Lander, chief data scientist at Lander Analytics, organizer of the New York Open Statistical Programming Meetup, and author of R for Everyone

“The command line is an essential tool in every data scientist's toolbox, and knowing it well makes it easy to translate questions you have of your data to real-time insights. Jeroennot only explains the basic Unix philosophy of how to chain together single-purpose tools to arrive at simple solutions for complex problems, but also introduces new command-line tools for data cleaning, analysis, visualization, and modeling.” — Jake Hofman, senior principal researcher at Microsoft Research, and adjunct assistant professor in the department of applied mathematics at Columbia University

“The Unix philosophy of simple tools, each doing one job well, then cleverly piped together, is embodied by the command line. Jeroen expertly discusses how to bring that philosophy into your work in data science, illustrating how the command line is not only the world of file input/output, but also the world of data manipulation, exploration, and even modeling.” — Chris H. Wiggins, associate professor in the department of applied physics and applied mathematics at Columbia University, and chief data scientist at The New York Timese]], but it’s by no means required to follow along with the examples.

About the Author

Jeroen Janssens teaches data science; often through training and coaching, occasionally through speaking, and infrequently through writing. His interests include visualizing data, building machine learning models, and automating things using either Python, R, or Bash. He is the author of Data Science at the Command Line, published by O’Reilly Media. Jeroen holds a PhD in machine learning from Tilburg University and an MSc in artificial intelligence from Maastricht University. Previously, he was an assistant professor at Jheronimus Academy of Data Science and a data scientist at Elsevier in Amsterdam and various startups in New York City. Currently, Jeroen is the CEO of Data Science Workshops, which organizes open enrollment workshops, in-company courses, inspiration sessions, hackathons, and meetups. All related to data science of course. He lives with his wife and two kids in Rotterdam, the Netherlands.