Data Science

How do you keep your sanity when working on a long-term project?

How do you keep your sanity when working on a long-term project?

Projects (and your involvement in them) come in all shapes and sizes. Some are like a Michael Mann bank heist; you’re in and out, never to be heard of again. Others are like a relationship. You know you’re going to be spending most of your time with them and even when you’re not together, they’re never far from your thoughts.

5 min read

To accuracy… and beyond

To accuracy… and beyond

In Data Science, when you learn about how to predict the future with machine learning, you are presented with an optimisation problem. The central focus of machine learning, at least the subset called “supervised learning”, is to ask “what is the best way to predict Y using X?” where “best” has a specific meaning. When you have past examples to learn from, “best” is some sort of measure of accuracy. How well did this particular model predict Y using X, when we can check it against past examples? Whether this is a classification task, where you’re trying to put things into the right category, or regression, where you are trying to get as close to a continuous number as you can, the higher the accuracy the better your model. There are ways to guard against complacency to ensure you’re not just learning the particulars of your dataset, but ultimately you are trying to get as high an accuracy score as possible.

3 min read

30 years of change in engines

30 years of change in engines

Engine design has changed a great deal since the Model T. Forced induction in 1920s increased engine power without increasing the size; fuel injection in the 1980s increased fuel efficiency and made carburettors a thing of the past; and more recently, the introduction of hybrid engines has massively reduced CO2 emissions.

5 min read

From Laptop to Cluster: Our Journey to Big Data

From Laptop to Cluster: Our Journey to Big Data

As the world accumulates data, more and more data science teams will be faced with the problem that their data no longer fits into their conventional workflow. Creating data-driven systems with hundreds of millions of rows of data requires a different set of tools to the ones statisticians, and even software developers, have been using in recent decades. How do we at Cox Automotive envisage moving from laptop-based analyses to big data systems?

5 min read
Back to Top ↑

spark

The Taming of the Skew - Part One

If you’ve done many joins in Spark, you’ve probably encountered the dreaded Data Skew at some point. The initial elation at how quickly Spark is ploughing through your tasks (“Wow, Spark is so fast!”) is later followed by dismay when you realise it’s been stuck on 199/200 tasks complete for the last 5 hours.

6 min read
Back to Top ↑

data skew

The Taming of the Skew - Part One

If you’ve done many joins in Spark, you’ve probably encountered the dreaded Data Skew at some point. The initial elation at how quickly Spark is ploughing through your tasks (“Wow, Spark is so fast!”) is later followed by dismay when you realise it’s been stuck on 199/200 tasks complete for the last 5 hours.

6 min read
Back to Top ↑

joins

The Taming of the Skew - Part One

If you’ve done many joins in Spark, you’ve probably encountered the dreaded Data Skew at some point. The initial elation at how quickly Spark is ploughing through your tasks (“Wow, Spark is so fast!”) is later followed by dismay when you realise it’s been stuck on 199/200 tasks complete for the last 5 hours.

6 min read
Back to Top ↑