Welcome to Data Science: An Introduction.

I'm Barton Poulson and what we are going to do in this course is We are going to have

a brief, accessible and non-technical overview of the field of Data Science. Now, some people

when they hear Data Science, they start thinking things like: Data and think about piles of

equations and numbers and then throw on top of that Science and think about people working

in their lab and they start to say eh, that's not for me. I'm not really a technical person

and that just seems much too techy. Well, here's the important thing to know. While

a lot of people get really fired up about the technical aspects of Data Science the

important thing is that Data Science is not so much a technical discipline, but creative.

And, really, that's true. The reason I say that is because in Data Science you use tools

that come from coding and statistics and from math But you use those to work creatively

with data.

The idea is there's always more than one way to solve a problem or answer

a question But most importantly to get insight Because the goal, no matter how you go about

it, is to get insight from your data. and what makes Data Science unique, compared to

so many other things is that you try to listen to all of your data, even when it doesn't

fit in easily with your standard approaches and paradigms you're trying to be much more

inclusive in your analysis and the reason you want to do that is because everything

signifies. everything carries meaning and everything can give you additional understanding

and insight into what's going on around you and so in this course what we are trying to

do is give you a map to the field of Data Science and how you can use it and so now

you have the map in your hands and you can get ready to get going with Data Science.

Welcome back to Data Science: An Introduction.

And we're going to begin this course by defining

data science. That makes sense. But, we are going to be doing it in kind of a funny way.

The first thing I am going to talk about is the demand for data science. So, let's take

a quick look. Now, data science can be defined in a few ways. I am going to give you some

short definitions. Take one on my definition is that data science is coding, math, and

statistics in applied settings. That's a reasonable working definition. But, if you want to be

a little more concise, I've got take two on a definition. That data science is the analysis

of diverse data, or data that you didn't think would fit into standard analytic approaches.

A third way to think about it is that data science is inclusive analysis.

It includes

all of the data, all of the information that you have, in order to get the most insightful

and compelling answer to your research questions. Now, you may say to yourself, "Wait… that's

it?" Well, if you're not impressed, let me show you a few things. First off, let's take

a look at this article. It says, "Data Scientist: the Sexiest Job of the 21st Century." And

please note, this is coming from Harvard Business Review. So, this is an authoritative source

and it is the official source of this saying: that data science is sexy! Now, again, you

may be saying to yourself, "Sexy? I hardly think so." Oh yeah, it's sexy. And the reason

data science is sexy is because first, it has rare qualities, and second it has high

demand. Let me say a little more about those.

The rare qualities are that data science takes

unstructured data, then finds order, meaning, and value in the data. Those are important,

but they're not easy to come across. Second, high demand. Well, the reason it's in high

demand is because data science provides insight into what's going on around you and critically,

it provides competitive advantage, which is a huge thing in business settings. Now, let

me go back and say a little more about demand.

Let's take a look at a few other sources.

So, for instance the McKinsey Global Institute published a very well-known paper, and you

can get it with this URL. And if you go to that webpage, this is what's going to come

up. And we're going to take a quick look at this one, the executive summary. It's a PDF

that you can download. And if you open that up, you will find this page. And let's take

a look at the bottom right corner. Two numbers here, I'm going to zoom in on those. The first

one, is they are projecting a need in the next few years for somewhere between 140 and

190,000 deep analytical talent positions. So, this means actual practicing data scientists.

That's a huge number; but almost ten times as high is 1.5 million more data-savvy managers

will be needed to take full advantage of big data in the United States. Now, that's people

who aren't necessarily doing the analysis but have to understand it, who have to speak

data. And that's one of the main purposes of this particular course, is to help people

who may or may not be the practicing data scientists learn to understand what they can

get out of data, and some of the methods used to get there.

Let's take a look at another

article from LinkedIn. Here is a shortcut URL for it and that will bring you to this

webpage: "The 25 hottest job skills that got people hired in 2014." And take a look at

number one here: statistical analysis and data mining, very closely related to data

science. And just to be clear, this was number one in Australia, and Brazil, and Canada,

and France, and India, and the Netherlands, and South Africa, and the United Arab Emirates,

and the United Kingdom.

Everywhere. And if you need a little more, let's take a look

at Glassdoor, which published an article this year, 2016, and it's about the "25 Best Jobs

in America." And look at number one right here, it's data scientist. And we can zoom

on this information. It says there is going to be 1,700 job openings, with a median base

salary of over $116,000, and fabulous career opportunities and job scores. So, if you want

to take all of this together, the conclusion you can reach is that data science pays. And

I can show you a little more about that. So for instance, here's a list of the top ten

highest paying salaries that I got from US News. We have physicians (or doctors), dentists,

and lawyers, and so on. Now, if we add data scientist to this list, using data from O'Reilly.com,

we have to push things around. And goes in third with an average total salary (not the

base we had in the other one, but the total compensation) of about $144,000 a year.

That's

extraordinary. So in sum, what do we get from all this? First off, we learn that there is

a very high demand for data science. Second, we learn that there is a critical need for

both specialists; those are the sort of practicing data scientists; and for Generalists, the

people who speak the language and know what can be done. And of course, excellent pay.

And all together, this makes Data Science a compelling career alternative and a way

of making you better at whatever you are doing. Back here in data science, we're going to

continue our attempt to define data science by looking at something that's really well

known in the field; the Data Science Venn Diagram.

Now if you want to, you can think

of this in terms of, "What are the ingredients of data science?" Well, we're going to first

say thanks to Drew Conway, the guy who came up with this. And if you want to see the original

article, you can go to this address. But, what Drew said is that data science is made

of three things. And we can put them as overlapping circles because it is the intersection that's

important. Here on the top left is coding or computer programming, or as he calls it:

hacking. On the top right is stats or, stats or mathematics, or quantitative abilities

in general. And on the bottom is domain expertise, or intimate familiarity with a particular

field of practice: business, or health, or education, or science, or something like that.

And the intersection here in the middle, that is data science.

So it's the combination of

coding and statistics and math and domain knowledge. Now, let's say a little more about

coding. The reason coding is important is because it helps you gather and prepare the

data. Because a lot of the data comes from novel sources and is not necessarily ready

for you to gather and it can be in very unusual formats. And so coding is important because

it can require some real creativity to get the data from the sources to put it into your

analysis. Now, a few kinds of coding that are important; for instance, there is statistical

coding. A couple of major languages in this are R and Python. Two open-source free programming

languages. R, specifically for data. Python is general-purpose, but well adapted to data.

The ability to work with databases is important too. The most common language there is SQL,

usually pronounced "Sequel," which stands for Structured Query Language, because that's

where the data is. Also, there is the command line interface, or if you are on a Mac, people

just call it "the terminal." Most common language there is Bash, which actually stands for Bourne-again

shell.

And then searching is important and regex, or regular expressions. While there

is not a huge amount to learn there (it's a small little field), it's sort of like super-powered

wildcard searching that makes it possible for you to both find the data and reformat

it in ways that are going to be helpful for your analysis. Now, let's say a few things

about the math. You're going to need things like a little bit of probability, some algebra,

of course, regression (very common statistical procedure). Those things are important. And

the reason you need the math is: because that is going to help you choose the appropriate

procedures to answer the question with the data that you have. And probably even more

importantly; it is going to help you diagnose problems when things don't go as expected.

And given that you are trying to do new things with new data in new ways, you are probably

going to come across problems.

So the ability to understand the mechanics of what is going

on is going to give you a big advantage. And the third element of the data science Venn

Diagram is some sort of domain expertise. Think of it as expertise in the field that

you're in. Business settings are common. You need to know about the goals of that field,

the methods that are used, and the constraints that people come across. And it's important

because whatever your results are, you need to be able to implement them well. Data science

is very practical and is designed to accomplish something. And your familiarity with a particular

field of practice is going to make it that much easier and more impactful when you implement

the results of your analysis. Now, let's go back to our Venn Diagram here just for a moment.

Because this is a Venn, we also have these intersections of two circles at a time.

At

the top is machine learning. At the bottom right is traditional research. And on the

bottom left hand is what Drew Conway called, "the danger zone." Let me talk about each

of these. First off, machine learning, or ML. Now, you think about machine learning

and the idea here is that it represents coding, or statistical programming and mathematics,

without any real domain expertise. Sometimes these are referred to as "black box" models.

They kind of throw data in and you don't even necessarily have to know what it means or

what language it is in, and it will just kind of crunch through it all and it will give

you some regularities. That can be very helpful, but machine learning is considered slightly

different from data science because it doesn't involve the particular applications in a specific

domain.

Also, there's traditional research. This is where you have math or statistics

and you have domain knowledge; often very intensive domain knowledge but without the

coding or programming. Now, you can get away with that because the data that you use in

traditional research is highly structured. It comes in rows and columns, and is typically

complete and is typically ready for analysis. Doesn't mean your life is easy, because now

you have to expand an enormous amount of effort in the methods and the designing of the project

and the interpretation of the data. So, still very heavy intellectual cognitive work, but

it comes from a different place. And then finally, there is what Conway called, "the

danger zone." And that's the intersection of coding and domain knowledge, but without

math or statistics. Now he says it is unlikely to happen, and that is probably true. On the

other hand, I can think of some common examples, what are called "word counts," where you take

a large document or a series of documents, and you count how many times a word appears

in there.

That can actually tell you some very important things. And also, drawing maps

and showing how things change across place and maybe even across time. You don't necessarily

have to have the math, but it can be very insightful and helpful. So, let's think about

a couple of backgrounds where people come from here. First, is coding. You can have

people who are coders, who can do math, stats, and business. So, you get the three things

(and this is probably the most common), most the people come from a programming background.

On the other hand, there is also stats, or statistics.

And you can get statisticians

who can code and who also can do business. That's less common, but it does happen. And

finally, there is people who come into data science from a particular domain. And these

are, for instance, business people who can code and do numbers. And they are the least

common. But, all of these are important to data science. And so in sum, here is what

we can take away. First, several fields make up Data Science. Second, diverse skills and

backgrounds are important and they are needed in data science. And third, there are many

roles involved because there are a lot of different things that need to happen. We'll

say more about that in our next movie. The next step in our data science introduction

and our definition of data science is to talk about the Data Science Pathway. So I like

to think of this as, when you are working on a major project, you have got to do one

step at a time to get it from here to there.

In data science, you can take the various

steps and you can put them into a couple of general categories. First, there are the steps

that involve planning. Second, there's the data prep. Third, there's the actual modeling

of the data. And fourth, there's the follow-up. And there are several steps within each of

these; I'll explain each of them briefly. First, let's talk about planning. The first

thing that you need to do, is you need to define the goals of your project so you know

how to use your resources well, and also so you know when you are done.

Second, you need

to organize your resources. So you might have data from several different sources; you might

have different software packages, you might have different people. Which gets us to the

third one: you need to coordinate the people so they can work together productively. If

you are doing a hand-off, it needs to be clear who is going to do what and how their work

is going to go together. And then, really to state the obvious, you need to schedule

the project so things can move along smoothly and you can finish in a reasonable amount

of time. Next is the data prep, where you are taking like food prep and getting the

raw ingredients ready.

First of course, is you need to get the data. And it can from

many different sources and be in many different formats. You need to clean the data and, the

sad thing is, this tends to be a very large part of any data science project. And that

is because you are bringing in unusual data from a lot of different places. You also want

to explore the data; that is, really see what it looks like, how many people are in each

group, what the shape of the distributions are like, what is associated with what.

And

you may need to refine the data. And that means choosing variables to include, choosing

cases to include or exclude, making any transformations to the data you need to do. And of course

these steps kind of can bounce back and forth from one to the other. The third group is

modeling or statistical modeling. This is where you actually want to create the statistical

model. So for instance, you might do a regression analysis or you might do a neural network.

But, whatever you do, once you create your model, you have to validate the model.

You

might do that with a holdout validation. You might do it really with a very small replication

if you can. You also need to evaluate the model. So, once you know that the model is

accurate, what does it actually mean and how much does it tell you? And then finally, you

need to refine the model. So, for instance, there may be variables you want to throw out;

maybe additional ones you want to include. You may want to, again, transform some of

the data. You may want to get it so it is easier to interpret and apply. And that gets

us to the last part of the data science pathway. And that's follow up. And once you have created

your model, you need to present the model. Because it is usually work that is being done

for a client, could be in house, could be a third party. But you need to take the insights

that you got and share them in a meaningful way with other people.

You also need to deploy

the model; it is usually being done in order to accomplish something. So, for instance,

if you are working with an e-commerce site, you may be developing a recommendation engine

that says, "people who bought this and this might buy this." You need to actually stick

it on the website and see if it works the way that you expected it to. Then you need

to revisit the model because a lot of the times, the data that you worked on is not

necessarily all of the data, and things can change when you get out in the real world

or things just change over time.

So, you have to see how well your model is working. And

then, just to be thorough, you need to archive the assets, document what you have, and make

it possible for you or for others to repeat the analysis or develop off of it in the future.

So, those are the general steps of what I consider the data science pathway. And in

sum, what we get from this is three things. First, data science isn't just a technical

field, it is not just coding. Things like, planning and presenting and implementing are

just as important. Also, contextual skills, knowing how it works in a particular field,

knowing how it will be implemented, those skills matter as well. And then, as you got

from this whole thing, there are a lot of things to do. And if you go one step at a

time, there will be less backtracking and you will ultimately be more productive in

your data science projects.

We'll continue our definition of data science by looking

at the roles that are involved in data science. The way that different people can contribute

to it. That's because it tends to be a collaborative thing, and it's nice to be able to say that

we are all together, working together towards a single goal. So, let's talk about some of

the roles involved in data science and how they contribute to the projects. First off,

let's take a look at engineers. These are people who focus on the back end hardware.

For instance, the servers and the software that runs them. This is what makes data science

possible, and it includes people like developers, software developers, or database administrators.

And they provide the foundation for the rest of the work.

Next, you can also have people

who are Big Data specialists. These are people who focus on computer science and mathematics,

and they may do machine learning algorithms as a way of processing very large amounts

of data. And they often create what are called data products. So, a thing that tells you

what restaurant to go to, or that says, "you might know these friends," or provides ways

of linking up photos. Those are data products, and those often involve a huge amount of very

technical work behind them. There are also researchers; these are people who focus on

domain-specific research. So, for instance, physics, or genetics, or whatever. And these

people tend to have very strong statistics, and they can use some of the procedures and

some of the data that comes from the other people like the big data researchers, but

they focus on the specific questions.

Also in the data science realm, you will find analysts.

These are people who focus on the day-to-day tasks of running a business. So for instance,

they might do web analytics (like Google analytics), or they might pull data from a SQL database.

And this information is very important and good for business. So, analysts are key to

the day-to-day function of business, but they may not be, exactly be Data Science proper,

because most of the data they are working with is going to be pretty structured. Nevertheless,

they play a critical role in business in general. And then, speaking of business. You have the

actual business people; the men and women who organize and run businesses. These people

need to be able to frame business-relevant questions that can be answered with the data.

Also, the business person manages the project and the efforts and the resources of others.

And while they may not actually be doing the coding, they must speak data; they must know

how the data works, what it can answer, and how to implement it. You can also have entrepreneurs.

So, you might have a data startup; they are starting their own little social network,

their own little web search platform.

An entrepreneur needs data and business skills. And truthfully,

they have to be creative at every step along the way. Usually because they are doing it

all themselves at a smaller scale. Then we have in data science something known as "the

full stack unicorn." And this is a person who can do everything at an expert level.

They are called a unicorn because truthfully, they may not actually exist. I will have more

to say about that later. But for right now, we can sum up what we got out of this video

by three things. Number one, data science is diverse. There's a lot of different people

who go into it, and they have different goals for their work, and they bring in different

skills and different experiences and different approaches. Also, they tend to work in very

different contexts. An entrepreneur works in a very different place from a business

manager, who works in a very different place from an academic researcher. But, all of them

are connected in some way to data science and make it a richer field.

The last thing

I want to say in "Data Science: An Introduction" where I am trying to define data science,

is to talk about teams in data science. The idea here is that data science has many different

tools, and different people are going to be experts in each one of them. Now, you have,

for instance, coding and you have statistics. Also, you have what feels like design, or

business and management that are involved. And the question, of course, is: "who can

do all of it? Who's able to do all of these things at the level that we need?" Well, that's

where we get this saying (I have mentioned it before), it's the unicorn.

And just like

in ancient history, the unicorn is a mythical creature with magical abilities. In data science,

it works a little differently. It is a mythical Data Scientist with universal abilities. The

trouble is, as we know from the real world, there are really no unicorns (animals), and

there are really not very many unicorns in data science. Really, there are just people.

And so we have to find out how we can do the projects even though we don't have this one

person who can do everything for everybody. So let's take a hypothetical case, just for

a moment. I am going to give you some fictional people.

Here is my fictional person Otto,

who has strong visualization skills, who has good coding, but has limited analytic or statistical

ability. And if we graph his stuff out, his abilities… So, here we have five things

that we need to have happen. And for the project to work, they all have to happen at least,

a level of eight on the zero-to-ten. If we take his coding ability, he is almost there.

Statistics, not quite halfway. Graphics, yes he can do that. And then, business, eh, alright.

And project, pretty good. So, what you can see here is, in only one of these five areas

is Otto sufficient on his own. On the other hand, let's pair him up with somebody else.

Let's take a look at Lucy. And Lucy has strong business training, has good tech skills, but

has limited graphics. And if we get her profile on the same thing that we saw, there is coding,

pretty good. Statistics, pretty good. Graphics, not so much.

Business, good. And projects,

OK. Now, the important thing here is that we can make a team. So let's take our two

fictional people, Otto and Lucy, and we can put together their abilities. Now, I actually

have to change the scale here a little bit to accommodate the both of them. But our criterion

still is at eight; we need a level of eight in order to do the project competently. And

if we combine them: oh look, coding is now past eight. Statistics is past eight. Graphics

is way past. Business way past. And then the projects, they are too. So when we combine

their skills, we are able to get the level that we need for everything. Or to put it

another way, we have now created a unicorn by team, and that makes it possible to do

the data science project.

So, in sum: you usually can't do data science on your own.

That's a very rare individual. Or more specifically: people need people, and in data science you

have the opportunity to take several people and make collective unicorns, so you can get

the insight that you need in your project and you can get the things done that you want.

In order to get a better understanding of data science, it can be helpful to look at

contrasts between data science and other fields. Probably the most informative is with Big

Data because these two terms are actually often confused. It makes me think of situations

where you have two things that are very similar, but not the same. Like we have here in the

Piazza San Carlo here in Italy. Part of the problem stems from the fact that data science

and big data both have Venn Diagrams associated with them. So, for instance, Venn number one

for data science is something we have seen already. We have three circles and we have

coding and we have math and we have some domain expertise, that put together get data science.

On the other hand, Venn Diagram number two is for Big Data.

It also has three circles.

And we have the high volume of data, the rapid velocity of data, and the extreme variety

of data. Take those three v's together and you get Big Data. Now, we can also combine

these two if we want in a third Venn Diagram, we call Big Data and Data Science. This time

it is just two circles. With Big Data on the left and Data Science on the right. And the

intersection in the middle, there is Big Data Science, which actually is a real term. But,

if you want to do a compare and contrast, it kind of helps to look at how you can have

one without the other. So, let's start by looking at Big Data without Data Science.

So, these are situations where you may have the volume or velocity or variety of data

but don't need all the tools of data science.

So, we are just looking at the left side of

the equation right now. Now, truthfully, this only works if you have Big Data without all

three V's. Some say you have to have the volume, velocity, and variety for it to count as Big

Data. I basically say anything that doesn't fit into a standard machine is probably Big

Data. I can think of a couple of examples here of things that might count as Big Data,

but maybe don't count as Data Science. Machine learning, where you can have very large data

sets and probably very complex, doesn't require very much domain expertise, so that may not

be data science.

Word counts, where you have an enormous amount of data and it's actually

a pretty simple analysis, again doesn't require much sophistication in terms of quantitative

skills or even domain expertise. So, maybe/maybe not data science. On the other hand, to do

any of these you are going to need to have at least two skills. You are going to need

to have the coding and you will probably have to have some sort of quantitative skills as

well. So, how about data science without Big Data? That's the right side of this diagram.

Well, to make that happen you are probably talking about data with just one of the three

V's from Big Data.

So, either volume or velocity or variety, but singly. So for instance, genetics

data. You have a huge amount of data and it comes in very set structure and it tends to

come in at once. So, you have got a lot of volume and it is a very challenging thing

to work with. You have to use data science, but it may or may not count as Big Data.

Similarly,

streaming sensor data, where you have data coming in very quickly, but you are not necessarily

saving it; you are just looking at these windows in it. That is a lot of velocity, and it is

difficult to deal with, and it takes Data Science, the full skill set, but it may not

require Big Data, per se. Or facial recognition, where you have enormous variety in the data

because you are getting photos or videos that are coming in. Again, very difficult to deal

with, requires a lot of ingenuity and creativity may or may not count as Big Data, depending

on how much of a stickler you are about definitions.

Now, if you want to combine the two, we can

talk about Big Data Science. In that case, we are looking right here at the middle. This

is a situation where you have volume, and velocity, and variety in your data and truthfully,

if you have the three of those, you are going to need the full Data Science skill set. You

are going to need coding, and statistics, and math, and you are going to have to have

domain expertise. Primarily because of the variety you are dealing with, but taken all

together you do have to have all of it. So in sum, here is what we get.

Big Data is not

equal to, it is not identical to data science. Now, there is common ground, and a lot of

people who are good at Big Data are good at data science and vice versa, but they are

conceptually distinct. On the other hand, there is the shared middle ground of Big Data

Science that unifies the two separate fields. Another important contrast you can make in

trying to understand data science is to compare it with coding or computer programming. Now,

this is where you are trying to work with machines and you are trying to talk to that

machine, to get it to do things. In one sense you can think of coding as just giving task

instructions; how to do something. It is a lot like a recipe when you're cooking. You

get some sort of user input or other input, and then maybe you have if/then logic, and

you get output from it. To take an extremely simple example, if you are programming in

Python version 2, you write: print, and then in quotes, "Hello, world!" will put the words

"Hello, world!" on the screen.

So, you gave it some instructions and it gave you some

output. Very simple programming. Now, coding and data gets a little more complicated. So,

for instance, there is word counts, where you take a book or a whole collection of books,

you take the words and you count how many there are in there. Now, this is a conceptually

simple task, and domain expertise and really math and statistics are not vital. But to

make valid inferences and generalizations in the face of variability and uncertainty

in the data you need statistics, and by extension, you need data science. It might help to compare

the two by looking at the tools of the respective trades. So for instance, there are tools for

coding or generic computer programming, and there are tools that are specific for data

science.

So, what I have right here is a list from the IEEE of the top ten programming languages

of 2015. And it starts at Java and C and goes down to Shell. And some of these are also

used for data science. So for instance, Python and R and SQL are used for data science, but

the other ones aren't major ones in data science. So, let's, in fact, take a look at a different

list of most popular tools for data science and you see that things move around a little

bit. Now, R is at the top, SQL is there, Python is there, but for me what is the most interesting

on the list is that Excel is number five, which would never be considered programming,

per se, but it is, in fact, a very important tool for data science. And that is one of

the ways that we can compare and contrast computer programming with data science. In

sum, we can say this: data science is not equal to coding.

They are different things.

On the other hand, they share some of the tools and they share some practices specifically

when coding for data. On the other hand, there is one very big difference in that statistics,

statistical ability is one of the major separators between general purpose programming and data

science programming. When we talk about data science and we are contrasting with some fields,

another field that a lot of people get confused and think they are the same thing is data

science and statistics. Now, I will tell you there is a lot in common, but we can talk

a little bit about the different focuses of each. And we also get into the issue of definitionalism

that data science is different because we define it differently, even when there is

an awful lot in common between the two.

It helps to take a look at some of the things

that go on in each field. So, let's start here about statistics. Put a little circle

here and we will put data science. And, to borrow a term from Steven J. Gould, we can

call these non-overlapping magisteria; NOMA. So, you think of them as separate fields that

are sovereign unto themselves with nothing to do with each other. But, you know, that

doesn't seem right; and part of that is that if we go back to the Data Science Venn Diagram,

statistics is one part of it. There it is in the top corner. So, now what do we do?

What's the relationship? So, it doesn't make sense to say these are totally separate areas,

maybe data science and statistics because they share procedures, maybe data science

is a subset or specialty of statistics, more like this.

But, if data science were just

a subset or specialty within statistics then it would follow that all data scientists would

first be statisticians. And interestingly that's just not so. Say, for instance, we

take a look at the data science stars, the superstars in the field. We go to a rather

intimidating article; it's called "The World's 7 Most Powerful Data Scientists" from Forbes.com.

You can see the article if you go to this URL. There's actually more than seven people,

because sometimes he brings them up in pairs. Let's check their degrees, see what their

academic training is in. If we take all the people on this list, we have five degrees

in computer science, three in math, two in engineering, and one each in biology, economics,

law, speech pathology, and one in statistics. And so that tells us, of course, these major

people in data science are not trained as statisticians. Only one of them has formal

training in that.

So, that gets us to the next question. Where do these two fields,

statistics and data science, diverge? Because they seem like they should have a lot in common,

but they don't have a lot in training. Specifically, we can look at the training. Most data scientists

are not trained, formally, as statisticians. Also, in practice, things like machine learning

and big data, which are central to data science, are not shared, generally, with most of statistics.

So, they have separate domains there.

And then there is the important issue of context.

Data scientists tend to work in different settings than statisticians. Specifically,

data scientists very often work in commercial settings where they are trying to get recommendation

engines or ways of developing a product that will make them money. So, maybe instead of

having data science a subset of statistics, we can think of it more as these two fields

have different niches. They both analyze data, but they do different things in different

ways. So, maybe it is fair to say they share, they overlap, they have analysis in common

of data, but otherwise, they are ecologically distinct. So, in sum: what we can say here

is that data science and statistics both use data and they analyze it. But the people in

each tend to come from different backgrounds, and they tend to function with different goals

and contexts. And in that way, render them to be conceptually distinct fields despite

the apparent overlap.

As we work to get a grasp on data science, there is one more contrast

I want to make explicitly, and that is between data science and business intelligence, or

BI. The idea here is that business intelligence is data in real life; it's very, very applied

stuff. The purpose of BI is to get data on internal operations, on market competitors,

and so on, and make justifiable decisions as opposed to just sitting in the bar and

doing whatever comes to your mind. Now, data science is involved with this, except, you

know, really there is no coding in BI. There's using apps that already exist. And the statistics

in business intelligence tend to be very simple, they tend to be counts and percentages and

ratios.

And so, it's simple, the light bulb is simple; it just does its one job there

is nothing super sophisticated there. Instead the focus in business intelligence is on domain

expertise and on really useful direct utility. It's simple, it's effective and it provides

insight. Now, one of the main associations with business intelligence is what are called

dashboards, or data dashboards. They look like this; it is a collection of charts and

tables that go together to give you a very quick overview of what is going on in your

business. And while a lot of data scientists may, let's say, look down their nose upon

dashboards, I'll say this, most of them are very well designed and you can learn a huge

amount about user interaction and the accessibility information from dashboards. So really, where

does data science come into this? What is the connection between data science and business

intelligence? Well, data science can be useful to BI in terms of setting it up.

Identifying

data sources and creating or setting up the framework for something like a dashboard or

a business intelligence system. Also, data science can be used to extend it. Data science

can be used to get past the easy questions and the easy data, to get the questions that

are actually most useful to you; even if they require really sometimes data that is hard

to wrangle and work with. And also, there is an interesting interaction here that goes

the other way. Data science practitioners can learn a lot about design from good business

intelligence applications. So, I strongly encourage anybody in data science to look

at them carefully and see what they can learn.

In sum: business intelligence, or BI, is very

goal oriented. Data science perhaps prepares the data and sets up the form for business

intelligence, but also data science can learn a lot about usability and accessibility from

business intelligence. And so, it is always worth taking a close look. Data science has

a lot of real wonderful things about it, but it is important to consider some ethical issues,

and I will specifically call this "do no harm" in your data science projects. And for that

we can say thanks to Hippocrates, the guy who gave us the Hippocratic Oath of Do No

Harm. Let's specifically talk about some of the important ethical issues, very briefly,

that come up in data science. Number one is privacy. That data tells you a lot about people

and you need to be concerned about the confidentiality. If you have private information about people,

their names, their social security numbers, their addresses, their credit scores, their

health, that's private, that's confidential, and you shouldn't share that information unless

they specifically gave you permission.

Now, one of the reasons this presents a special

challenge in data science because, we will see later, a lot of the sources that are used

in data science were not intended for sharing. If you scrape data from a website or from

PDFs, you need to make sure that it is ok to do that. But it was originally created

without the intention of sharing, so privacy is something that really falls upon the analyst

to make sure they are doing it properly. Next, is anonymity. One of the interesting things

we find is that it is really not hard to identify people in data. If you have a little bit of

GPS data and you know where a person was at four different points in time, you have about

a 95% chance of knowing exactly who they are. You look at things like HIPAA, that's the

Health Insurance Portability and Accountability Act. Before HIPAA, it was really easy to identify

people from medical records. Since then, it has become much more difficult to identify

people uniquely.

That's an important thing for really people's well-being. And then also,

proprietary data; if you are working for a client, a company, and they give you their

own data, that data may have identifiers. You may know who the people are, they are

not anonymous anymore. So, anonymity may or may not be there, but major efforts to make

data anonymous. But really, the primary thing is even if you do know who they are, that

you still maintain the privacy and confidentiality of the data. Next, there is the issue about

copyright, where people try to lock down information. Now, just because something is on the web,

doesn't mean that you are allowed to use it. Scraping data from websites is a very common

and useful way of getting data for projects. You can get data from web pages, from PDFs,

from images, from audio, from really a huge number of things. But, again the assumption

that because it is on the web, it's ok to use it is not true.

You always need to check

copyright and make sure that it is acceptable for you to access that particular data. Next,

and our very ominous picture, is data security and the idea here is that when you go through

all the effort to gather data, to clean up and prepare for analysis, you have created

something that is very valuable to a lot of people and you have to be concerned about

hackers trying to come in and steal the data, especially if the data is not anonymous and

it has identifiers in it. And so, there is an additional burden to place on the analyst

to ensure to the best of their ability that the data is safe and cannot be broken into

and stolen. And that can include very simple things like a person who is on their project

but is no longer, but took the data on a flash drive.

You have to find ways to make sure

that that can't happen as well. There's a lot of possibilities, it's tricky, but it

is something that you have to consider thoroughly. Now, two other things that come up in terms

of ethics, but usually don't get addressed in these conversations. Number one is potential

bias. The idea here is that the algorithms or the formulas that are used in data science

are only as neutral or bias-free as the rules and the data that they get. And so, the idea

here is that if you have rules that address something that is associated with, for instance,

gender or age or race or economic standing, you might unintentionally be building in those

factors. Which, say for instance, say for title nine, you are not supposed to. You might

be building those into the system without being aware of it, and an algorithm has this

sheen of objectivity, and people say they can place confidence in it without realizing

that it is replicating some of the prejudices that may happen in real life.

Another issue

is overconfidence. And the idea here is that analyses are limited simplifications. They

have to be, that is just what they are. And because of this, you still need humans in

the loop to help interpret and apply this. The problem is when people run an algorithm

to get a number, say to ten decimal places, and they say, "this must be true," and treat

it as written-in-stone absolutely unshakeable truth, when in fact, if the data were biased

going in; if the algorithms were incomplete, if the sampling was not representative, you

can have enormous problems and go down the wrong path with too much confidence in your

own analyses.

So, once again humility is in order when doing data science work. In sum:

data science has enormous potential, but it also has significant risks involved in the

projects. Part of the problem is that analyses can't be neutral, that you have to look at

how the algorithms are associated with the preferences, prejudices, and biases of the

people who made them. And what that means is that no matter what, good judgment is always

vital to quality and success of a data science project. Data Science is a field that is strongly

associated with its methods or procedures.

In this section of videos, we're going to

provide a brief overview of the methods that are used in data science. Now, just as a quick

warning, in this section things can get kind of technical and that can cause some people

to sort of freak out. But, this course is a non-technical overview. The technical hands

on stuff is in the other courses. And it is really important to remember that tech is

simply the means to doing data science. Insight or the ability to find meaning in your data,

that's the goal. Tech only helps you get there. And so, we want to focus primarily on insight

and the tools and the tech as they serve to further that goal. Now, there's a few general

categories we are going to talk about, again, with an overview for each of these. The first

one is sourcing or data sourcing.

That is how to get the data that goes into data science,

the raw materials that you need. Second is coding. That again is computer programming

that can be used to obtain and manipulate and analyze the data. After that, a tiny bit

of math and that is the mathematics behind data science methods that really form the

foundations of the procedures.

And then stats, the statistical methods that are frequently

used to summarize and analyze data, especially as applied to data science. And then there

is machine learning, ML, this is a collection of methods for finding clusters in the data,

for predicting categories or scores on interesting outcomes. And even across these five things,

even then, the presentations aren't too techie-crunchy, they are basically still friendly. Really,

that's the way it is. So, that is the overview of the overviews. In sum: we need to remember

that data science includes tech, but data science is greater than tech, it is more than

those procedures.

And above all, that tech while important to data science is still simply

a means to insight in data. The first step in discussing data science methods is to look

at the methods of sourcing, or getting data that is used in data science. You can think

of this as getting the raw materials that go into your analyses. Now, you have got a

few different choices when it comes to this in data science. You can use existing data,

you can use something called data APIs, you can scrape web data, or you can make data.

We'll talk about each of those very briefly in a non-technical manner. For right now,

let me say something about existing data. This is data that already is at hand and it

might be in-house data. So if you work for a company, it might be your company records.

Or, you might have open data; for instance, many governments and many scientific organizations

make their data available to the public.

And then there is also third party data. This

is usually data that you buy from a vendor, but it exists and it is very easy to plug

it in and go. You can also use APIs. Now, that stands for Application Programming Interface,

and this is something that allows various computer applications to communicate directly

with each other. It's like phones for your computer programs. It is the most common way

of getting web data, and the beautiful thing about it is it allows you to import that data

directly into whatever program or application you are using to analyze the data. Next is

scraping data. And this is where you want to use data that is on the web, but they don't

have an existing API.

And what that means, is usually data that's in HTML web tables

and pages, maybe PDFs. And you can do this either with using specialized applications

for scraping data or you can do it in a programming language, like R or Python, and write the

code to do the data scraping. Or another option is to make data. And this lets you get exactly

what you need; you can be very specific and you can get what you need. You can do something

like interviews, or you can do surveys, or you can do experiments. There is a lot of

approaches, most of them require some specialized training in terms of how to gather quality

data. And that is actually important to remember, because no matter what method you use for

getting or making new data, you need to remember this one little aphorism you may have heard

from computer science.

It goes by the name of GIGO: that actually stands for "Garbage

In, Garbage Out," and it means if you have bad data that you are feeding into your system,

you are not going to get anything worthwhile, any real insights out of it. Consequently,

it is important to pay attention to metrics or methods for measuring and the meaning – exactly

what it is that they tell you. There's a few ways you can do this. For instance, you can

talk about business metrics, you can talk about KPIs, which means Key Performance Indicators,

also used in business settings. Or SMART goals, which is a way of describing the goals that

are actionable and timely and so on. You can also talk about, in a measurement sense, classification

accuracy. And I will discuss each of those in a little more detail in a later movie.

But for right now, in sum, we can say this: data sourcing is important because you need

to get the raw materials for your analysis. The nice thing is there's many possible methods,

many ways that you can use to get the data for data science.

But no matter what you do,

it is important to check the quality and the meaning of the data so you can get the most

insight possible out of your project. The next step we need to talk about in data science

methods is coding, and I am going to give you a very brief non-technical overview of

coding in data science. The idea here is that you are going to get in there and you are

going to King of the Jungle/master of your domain and make the data jump when you need

it to jump. Now, if you remember when we talked about the Data Science Venn Diagram at the

beginning, coding is up here on the top left. And while we often think about sort of people

typing lines of code (which is very frequent), it is more important to remember when we talk

about coding (or just computers in general), what we are really talking about here is any

technology that lets you manipulate the data in the ways that you need to perform the procedures

you need to get the insight that you want out of your data.

Now, there are three very

general categories that we will be discussing here on datalab. The first is apps; these

are specialized applications or programs for working with data. The second is data; or

specifically, data formats. There's special formats for web data, I will mention those

in a moment. And then, code; there are programming languages that give you full control over

what the computer does and how you interact with the data.

Let's take a look at each one

very briefly. In terms of apps, there are spreadsheets, like Excel or Google Sheets.

These are the fundamental data tools of probably a majority of the world. There are specialized

applications, like Tableau for data visualization, or SPSS, it is a very common statistical package

in the social sciences and in businesses, and one of my favorites, JASP, which is a

free open source analog of SPSS, which actually I think is a lot easier to use and replicate

research with. And, there are tons of other choices. Now, in terms of web data, it is

helpful to be familiar with things like HTML, and XML, and JSON, and other formats that

are used to encapsulate data on the web, because those are the things that you are going to

have to be programming about to interact with when you get your data. And then there are

actual coding languages. R is probably the most common, along with Python; general purpose

language, but it has been well adapted for data use. There's SQL, the structured query

language for databases, and very basic languages like C, C++, and Java, which are used more

in the back-end of data science.

And then there is Bash, the most common command line

interface, and regular expressions. And we will talk about all of these in other courses

here at datalab. But, remember this: tools are just tools. They are only one part of

the data science process. They are a means to the end, and the end, the goal is insight.

You need to know where you are trying to go and then simply choose the tools that help

you reach that particular goal. That's the most important thing. So, in sum, here's a

few things: number one, use your tools wisely. Remember your questions need to drive the

process, not the tools themselves. Also, I will just mention that a few tools is usually

enough. You can do an awful lot with Excel and R. And then, the most important thing

is: focus on your goal and choose your tools and even your data to match the goal, so you

can get the most useful insights from your data.

The next step in our discussion of data

science methods is mathematics, and I am going to give a very brief overview of the math

involved in data science. Now, the important thing to remember is that math really forms

the foundation of what we're going to do. If you go back to the Data Science Venn Diagram,

we've got stats up here in the right corner, but really it's math and stats, or quantitative

ability in general, but we'll focus on the math part right here.

And probably the most

important question is how much math is enough to do what you need to do? Or to put it another

way, why do you need math at all, because you have got a computer to do it? Well, I

can think of three reasons you don't want to rely on just the computer, but it is helpful

to have some sound mathematical understanding. Here they are: number one, you need to know

which procedures to use and why. So you have your question, you have your data and you

need to have enough of an understanding to make an informed choice.

That's not terribly

difficult. Two, you need to know what to do when things don't work right. Sometimes you

get impossible results. I know that statistics you can get a negative adjusted R2; that's

not supposed to happen. And it is good to know the mathematics that go into calculating

that so you can understand how something apparently impossible can work. Or, you are trying to

do a factor analysis or principal component and you get a rotation that won't convert.

It helps to understand what it is about the algorithm that's happening, and why that won't

work in that situation.

And number three, interestingly, some procedures, some math

is easier and quicker to do by hand than by firing up the computer. And I'll show you

a couple of examples in later videos, where that can be the case. Now, fundamentally there

is a nice sort of analogy here. Math is to data science as, for instance, chemistry is

to cooking, kinesiology is to dancing, and grammar is to writing. The idea here is that

you can be a wonderful cook without knowing any chemistry, but if you know some chemistry

it is going to help. You can be a wonderful dancer without know kinesiology, but it is

going to help. And you can probably be a good writer without having an explicit knowledge

of grammar, but it is going to make a big difference.

The same thing is true of data

science; you will do it better if you have some of the foundational information. So,

the next question is: what kinds of math do you need for data science? Well, there's a

few answers to that. Number one is algebra; you need some elementary algebra. That is,

the basically simple stuff. You can have to do some linear or matrix algebra because that

is the foundation of a lot of the calculations. And you can also have systems of linear equations

where you are trying to solve several equations all at once. It is a tricky thing to do, in

theory, but this is one of the things that is actually easier to do by hand sometimes.

Now, there's more math. You can get some Calculus. You can get some big O, which has to do with

the order of a function, which has to do with sort of how fast it works.

Probability theory

can be important, and Bayes' theorem, which is a way of getting what is called a posterior

probability can also be a really helpful tool for answering some fundamental questions in

data science. So in sum: a little bit of math can help you make informed choices when planning

your analyses. Very significantly, it can help you find the problems and fix them when

things aren't going right. It is the ability to look under the hood that makes a difference.

And then truthfully, some mathematical procedures, like systems of linear equations, that can

even be done by hand, sometimes faster than you can do with a computer.

So, you can save

yourself some time and some effort and move ahead more quickly toward your goal of insight.

Now, data science wouldn't be data science and its methods without a little bit of statistics.

So, I am going to give you a brief statistics overview here of how things work in data science.

Now, you can think of statistics as really an attempt to find order in chaos, find patterns

in an overwhelming mess. Sort of like trying to see the forest and the trees. Now, let's

go back to our little Venn Diagram here. We recently had math and stats here in the top

corner. We are going to go back to talking about stats, in particular. What you are trying

to do here; one thing is to explore your data. You can have exploratory graphics, because

we are visual people and it is usually easiest to see things.

You can have exploratory statistics,

a numerical exploration of the data. And you can have descriptive statistics, which are

the things that most people would have talked about when they took a statistics class in

college (if they did that). Next, there is inference. I've got smoke here because you

can infer things about the wind and the air movement by looking at patterns in smoke.

The idea here is that you are trying to take information from samples and infer something

about a population. You are trying to go from one source to another. One common version

of this is hypothesis testing. Another common version is estimations, sometimes called Confidence

Intervals. There are other ways to do it, but all of these let you go beyond the data

at hand to making larger conclusions.

Now, one interesting thing about statistics is

you're going to have to be concerned with some of the details and arranging things just

so. For instance, you get to do something like feature selection and that's picking

variables that should be included or combinations and there are problems that can come up that

are frequent problems and I will address some of those in later videos.

There's also the

matter of validation. When you create a statistical model you have to see if it is actually accurate.

Hopefully, you have enough data that you can have a holdout sample and do that, or you

can replicate the study. Then, there is the choice of estimators that you use; how you

actually get the coefficients or the combinations in your model. And then there's ways of assessing

how well your model fits the data. All of these are issues that I'll address briefly

when we talk about statistical analysis at greater length. Now, I do want to mention

one thing in particular here, and I just call this "beware the trolls." There are people

out there who will tell you that if you don't do things exactly the way they say to do it,

that your analysis is meaningless, that your data is junk and you've lost all your time.

You know what? They're trolls.

So, the idea here is don't listen to that. You can make

enough of an informed decision on your own to go ahead and do an analysis that is still

useful. Probably one of the most important things to think about in this is this wonderful

quote from a very famous statistician and it says, "All models or all statistical models

are wrong, but some are useful." And so the question isn't whether you're technically

right, or you have some sort of level of intellectual purity, but whether you have something that

is useful. That, by the way, comes from George Box. And I like to think of it basically as

this: as wave your flag, wave your "do it yourself" flag, and just take pride in what

you're able to accomplish even when there are people who may be criticizing it.

Go ahead,

you're doing something, go ahead and do it. So, in sum: statistics allow you to explore

and describe your data. It allows you to infer things about the population. There is a lot

of choices available, a lot of procedures. But no matter what you do, the goal is useful

insight. Keep your eyes on that goal and you will find something meaningful and useful

in your data to help you in your own research and projects. Let's finish our data science

methods overview by getting a brief overview of Machine Learning. Now, I've got to admit

when you say the term "machine learning," people start thinking something like, "the

robot overlords are going to take over the world." That's not what it is. Instead, let's

go back to our Venn Diagram one more time, and in the intersection at the top between

coding and stats is Machine Learning or as it's commonly called, just ML.

The goal of

Machine Learning is to go and work in data space so you can, for instance, you can take

a whole lot of data (we've got tons of books here), and then you can reduce the dimensionality.

That is, take a very large, scattered, data set and try to find the most essential parts

of that data. Then you can use these methods to find clusters within the data; like goes

with like. You can use methods like k-means. You can also look for anomalies or unusual

cases that show up in the data space. Or, if we go back to categories again, I talked

about like for like. You can use things like logistic regression or k-nearest neighbors,

KNN. You can use Naive Bayes for classification or Decision Trees or SVM, which is Support

Vector Machines, or artificial neural nets.

Any of those will help you find the patterns

and the clumping in the data so you can get similar cases next to each other, and get

the cohesion that you need to make conclusions about these groups. Also, a major element

of machine learning is predictions. You're going to point your way down the road. The

most common approach here; the most basic is linear regression, multiple regression.

There is also Poisson regression, which is used for modeling count or frequency data.

And then there is the issue of Ensemble models, where you create several models and you take

the predictions from each of those and you put them together to get an overall more reliable

prediction. Now, I will talk about each of these in a little more detail in later courses,

but for right now I mostly just want you to know that these things exist, and that's what

we mean when we refer to Machine Learning. So, in sum: machine learning can be used to

categorize cases and to predict scores on outcomes. And there's a lot of choices, many

choices and procedures available. But, again, as I said with statistics, and I'll also say

again many times after this, no matter what, the goal is not that "I'm going to do an artificial

neural network or a SVM," the goal is to get useful insight into your data.

Machine learning

is a tool, and use it to the extent that it helps you get that insight that you need.

In the last several videos I've talked about the role in data science of technical things.

On the other hand, communicating is essential to the practice, and the first thing I want

to talk about there is interpretability. The idea here is that you want to be able to lead

people through a path on your data.

You want to tell a data-driven story, and that's the

entire goal of what you are doing with data science. Now, another way to think about this

is: when you are doing your analysis, what you're trying to do is solve for value. You're

making an equation. You take the data, you're trying to solve for value. The trouble is

this: a lot of people get hung up on analysis, but they need to remember that analysis is

not the same thing as value.

Instead, I like to think of it this way: that analysis times

story is equal to value. Now, please note that's multiplicative, not additive, and so

one consequence of that is when you go back to, analysis times story equals value. Well,

if you have zero story you're going to have zero value because, as you recall, anything

times zero is zero. So, instead of that let's go back to this and say what we really want

to do is, we want to maximize the story so that we can maximize the value that results

from our analysis. Again, maximum value is the overall goal here. The analysis, the tools,

the tech, are simply methods for getting to that goal. So, let's talk about goals. For

instance, an analysis is goal-driven. You are trying to accomplish something that's

specific, so the story, or the narrative, or the explanation you give about your project

should match those goals. If you are working for a client that has a specific question

that they want you to answer, then you have a professional responsibility to answer those

questions clearly and unambiguously, so they know whether you said yes or no and they know

why you said yes or no.

Now, part of the problem here is the fact the client isn't you and

they don't see what you do. And as I show here, simply covering your face doesn't make

things disappear. You have to worry about a few psychological abstractions. You have

to worry about egocentrism. And I'm not talking about being vain, I'm talking about the idea

that you think other people see and know and understand what you know. That's not true;

otherwise, they wouldn't have hired you in the first place. And so you have to put it

in terms that the client works with, and that they understand, and you're going to have

to get out of your own center in order to do that. Also, there's the idea of false consensus;

the idea that, "well everybody knows that." And again, that's not true, otherwise, they

wouldn't have hired you. You need to understand that they are going to come from a different

background with a different range of experience and interpretation.

You're going to have to

compensate for that. A funny little thing is the idea about anchoring. When you give

somebody an initial impression, they use that as an anchor, and then they adjust away from

it. So if you are going to try to flip things over on their heads, watch out for giving

a false impression at the beginning unless you absolutely need to. But most importantly,

in order to bridge the gap between the client and you, you need to have clarity and explain

yourself at each step. You can also think about the answers. When you are explaining

the project to the client, you might want to start in a very simple procedure: state

the question that you are answering.

Give your answer to that question, and if you need

to, qualify as needed. And then, go in order top to bottom, so you're trying to make it

as clear as possible what you're saying, what the answer is, and make it really easy to

follow. Now, in terms of discussing your process, how you did this all. Most of the time it

is probably the case of they don't care, they just want to know what the answer is and that

you used a good method to get that. So, in terms of discussing processes or the technical

details, only when absolutely necessary. That's something to keep in mind. The process here

is to remember that analysis, which means breaking something apart. This, by the way,

is a mechanical typewriter broken into its individual component.

Analysis means to take

something apart, and analysis of data is an exercise in simplification. You're taking

the overall complexity, sort of the overwhelmingness of the data, and you're boiling it down and

finding the patterns that make sense and serve the needs of your client. Now, let's go to

a wonderful quote from our friend Albert Einstein here, who said, "Everything should be made

as simple as possible, but not simpler." That's true in presenting your analysis. Or, if you

want to go see the architect and designer Ludwig Mies van der Rohe, who said, "Less

is more." It is actually Robert Browning who originally said that, but Mies van der Rohe

popularized it. Or, if you want another way of putting a principle that comes from my

field, I'm actually a psychological researcher; they talk about being minimally sufficient.

Just enough to adequately answer the question. If you're in commerce you know about a minimal

viable product, it is sort of the same idea within analysis here, the minimal viable analysis.

So, here's a few tips: when you're giving a presentation, more charts, less text, great.

And then, simplify the charts; remove everything that doesn't need to be in there.

Generally,

you want to avoid tables of data because those are hard to read. And then, one more time

because I want to emphasize it, less text again. Charts, tables can usually carry the

message. And so, let me give you an example here. I'm going to give a very famous dataset

at Berkeley admissions. Now, these are not stairs at Berkeley, but it gives the idea

of trying to get into something that is far off and distant. Here's the data; this is

graduate school admissions in 1973, so it's over 40 years ago. The idea is that men and

women were both applying for graduate school at the University of California Berkeley.

And what we found is that 44 percent of the men who applied were admitted, that's their

part in green.

And of the women, only 35 percent of women were admitted when they applied.

So, really at first glance this is bias, and it actually led to a lawsuit, it was a major

issue. So, what Berkeley then tried to do was find out, "well which programs are responsible

for this bias?" And they got a very curious set of results. If you break the applications

down by program (and here we are calling them A through F), six different programs.

What

you find, actually, is that in each of these male applicants on the left female applicants

are on the right. If you look at program A, women actually got accepted at a higher rate,

and the same is true for B, and the same is true for D, and the same is true for F. And

so, this is a very curious set of responses and it is something that requires explanation.

Now in statistics, this is something that is known as Simpson's Paradox. But here is

the paradox: bias may be negligible at the department level. And in fact, as we saw in

four of the departments, there was a possible bias in favor of women. And the problem is

that women applied to more selective programs, programs with lower acceptance rates. Now,

some people stop right here and say therefore, "nothing is going on, nothing to complain

about." But you know, that's still ending the story a little bit early.

There are other

questions that you can ask, and as producing a data-driven story, this is stuff that you

would want to do. So, for instance, you may want to ask, "why do the programs vary in

overall class size? Why do the acceptance rates differ from one program to the other?

Why do men and women apply to different programs?" And you might want to look at things like

the admissions criteria for each of the programs, the promotional strategies, how they advertise

themselves to students. You might want to look at the kinds of prior education the students

have in the programs, and you really want to look at funding level for each of the programs.

And so, really, you get one answer, at least more questions, maybe some more answers, and

more questions, and you need to address enough of this to provide a comprehensive overview

and solution to your client.

In sum, let's say this: stories give value to data analysis.

And when you tell the story, you need to make sure that you are addressing your client's'

goals in a clear, unambiguous way. The overall principle here is be minimally sufficient.

Get to the point, make it clear. Say what you need to, but otherwise be concise and

make your message clear. The next step in discussing data science and communicating

is to talk about actionable insights, or information that can be used productively to accomplish

something. Now, to give sort of a bizarre segue here, you look at a game controller.

It may be a pretty thing, it may be a nice object, but remember: game controllers exist

to do something. They exist to help you play the game and to do it as effectively as possible.

They have a function, they have a purpose.

Same way data is for doing. Now, that's a

paraphrase for one of my favorite historical figures. This is William James, the father

of American Psychology, and pragmatism is philosophy. And he has this wonderful quote,

he said, "My thinking is first and last and always for the sake of my doing." And the

idea applies to analysis. Your analysis and your data is for the sake of your doing. So,

you're trying to get some sort of specific insight in how you should proceed.

What you

want to avoid is the opposite of this from one of my other favorite cultural heroes,

the famous Yankees catcher Yogi Berra, who said, "We're lost, but we're making good time."

The idea here is that frantic activity does not make up for lack of direction. You need

to understand what you are doing so you can reach the particular goal. And your analysis

is supposed to do that. So, when you're giving your analysis, you're going to try to point

the way.

Remember, why was the project conducted? The goal is usually to direct some kind of

action, reach some kind of goal for your client. And that the analysis should be able to guide

that action in an informed way. One thing you want to do is, you want to be able to

give the next steps to your client. Give the next steps; tell them what they need to do

now. You want to be able to justify each of those recommendations with the data and your

analysis. As much as possible be specific, tell them exactly what they need to do. Make

sure it's doable by the client, that it's within their range of capability. And that

each step should build on the previous step. Now, that being said, there is one really

fundamental sort of philosophical problem here, and that's the difference between correlation

and causation. Basically, it goes this way: your data gives you correlation; you know

that this is associated with that.

But your client doesn't simply want to know what's

associated; they want to know what causes something. Because if they are going to do

something, that's an intervention designed to produce a particular result. So, really,

how do you get from the correlation, which is what you have in the data, to the causation,

which is what your client wants? Well, there's a few ways to do that.

One is experimental

studies; these are randomized, controlled trials. Now, that's theoretically the simplest

path to causality, but it can be really tricky in the real world. There are quasi-experiments,

and these are methods, a whole collection of methods. They use non-randomized data,

usually observational data, adjusted in particular ways to get an estimate of causal inference.

Or, there's the theory and experience. And this is research-based theory and domain-specific

experience. And this is where you actually get to rely on your client's information.

They can help you interpret the information, especially if they have greater domain expertise

than you do. Another thing to think about are the social factors that affect your data.

Now, you remember the data science Venn Diagram. We've looked at it lots of times. It has got

these three elements. Some proposed adding a fourth circle to this Venn diagram, and

we'll kind of put that in there and say that social understanding is also important, critical

really, to valid data science.

Now, I love that idea, and I do think that it's important

to understand how things are going to play out. There are a few kinds of social understanding.

You want to be aware of your client's mission. You want to make sure that your recommendations

are consistent with your client's mission. Also, that your recommendations are consistent

with your client's identity; not just, "This is what we do," but, "This is really who we

are." You need to be aware of the business context, sort of the competitive environment

and the regulatory environment that they're working in.

As well as the social context;

and that can be outside of the organization, but even more often within the organization.

Your recommendations will affect relationships within the client's organization. And you

are going to try to be aware of those as much as you can to make it so that your recommendations

can be realized the way they need to be. So, in sum: data science is goal focused, and

when you're focusing on that goal for your client you need to give specific next steps

that are based on your analysis and justifiable from the data. And in doing so, be aware of

the social, political, and economic context that gives you the best opportunity of getting

something really useful out of your analysis. When you're working in data science and trying

to communicate your results, presentation graphics can be an enormously helpful tool.

Think of it this way: you are trying to paint a picture for the benefit of your client.

Now, when you're working with graphics there can be a couple of different goals; it depends

on what kind of graphics you're working with.

There's the general category of exploratory

graphics. These are ones that you are using as the analyst. And for exploratory graphics,

you need speed and responsiveness, and so you get very simple graphics. This is a base

histogram in R. And they can get a little more sophisticated and this is done in ggplot2.

And you can break it down into a couple other histograms, or you can make it a different

way, or make it see-through, or split them apart into small multiples.

But in each case,

this is done for the benefit of you as the analyst understanding the data. These are

quick, they're effective. Now, they are not very well-labeled, and they are usually for

your insight, and then you do other things as a result of that. On the other hand, presentation

graphics which are for the benefit of your client, those need clarity and they need a

narrative flow. Now, let me talk about each of those characteristics very briefly. Clarity

versus distraction. There are things that can go wrong in graphics.

Number one is color.

Colors can actually be a problem. Also, three-dimensional or false dimensions are nearly always a distraction.

One that gets a little touchy for some people is interaction. We think of interactive graphics

as really cool, great things to have, but you run the risk of people getting distracted

by the interaction and start playing around with it. Going, like, "Ooh, I press here it

does that." And that distracts from the message. So actually, it may be important to not have

interaction. And then the same thing is true of animation. Flat, static graphics can often

be more informative because they have fewer distractions in them. Let me give you a quick

example of how not to do things.

Now, this is a chart that I made. I made it in Excel,

and I did it based on some of the mistakes I've seen in graphics submitted to me when

I teach. And I guarantee you, everything in here I have seen in real life, just not necessarily

combined all at once. Let's zoom in on this a little bit, so we can see the full badness

of this graphic. And let's see what's going on here. We've got a scale here that starts

at 8 goes to 28% and is tiny; doesn't even cover the range of the data. We've got this

bizarre picture on the wall. We've got no access lines on the walls. We come down here;

the labels for educational levels are in alphabetical order, instead of the more logical higher

levels of education. Then we've got the data represented as cones, which are difficult

to read and compare, and it's only made worse by the colors and the textures. You know,

if you want to take an extreme, this one for grad degrees doesn't even make it to the floor

value of 8% and this one for high school grad is cut off at the top at 28%.

This, by the

way, is a picture of a sheep, and people do this kind of stuff and it drives me crazy.

If you want to see a better chart with the exact same data, this is it right here. It

is a straight bar chart. It's flat, it's simple, it's as clean as possible. And this is better

in many ways. Most effective here is that it communicates clearly. There's no distractions,

it's a logical flow. This is going to get the point across so much faster.

And I can

give you another example of it; here's a chart previously about salaries for incomes. I have

a list here, I've got data scientist in it. If I want to draw attention to it, I have

the option of putting a circle around it and I can put a number next to it to explain it.

That's one way to make it easy to see what's going on. We don't even have to get fancy.

You know, I just got out a pen and a post-it note and I drew a bar chart of some real data

about life expectancy. This tells the story as well, that there is something terribly

amiss in Sierra Leone. But, now let's talk about creating narrative flow in your presentation

graphics. To do this, I am going to pull some charts from my most cited academic paper,

which is called, A Third Choice: A Review of Empirical Research on the Psychological

Outcomes of Restorative Justice. Think of it as mediation for juvenile crimes, mostly

juvenile. And this paper is interesting because really it's about fourteen bar charts with

just enough text to hold them together.

And you can see there's a flow. The charts are

very simple; this is judgments about whether the criminal justice system was fair. The

two bars on the left are victims; the two bars on the right are offenders. And for each

group on the left are people who participated in restorative justice, so more victim-offender

mediation for crimes. And for each set on the right are people who went through standard

criminal procedures. It says court, but it usually means plea bargaining. Anyhow, it's

really easy to see that in both cases the restorative justice bar is higher; people

were more likely to say it was fair. They also felt that they had an opportunity to

tell their story; that's one reason why they might think it's fair. They also felt the

offender was held accountable more often. In fact, if you go to court on the offenders,

that one's below fifty percent and that's the offenders themselves making the judgment.

Then you can go to forgiveness and apologies.

And again, this is actually a simple thing

to code and you can see there's an enormous difference. In fact, one of the reasons there

is such a big difference is because instead of court preceding, the offender very rarely

meets the victim. It also turns out I need to qualify this a little bit because a bunch

of the studies included drunk driving with no injuries or accidents. Well, when we take

them out, we see a huge change. And then we can go to whether a person is satisfied with

the outcome. Again, we see an advantage for restorative justice. Whether the victim is

still upset about the crime, now the bars are a little bit different. And whether they

are afraid of revictimization and that is over a two to one difference.

And then finally

recidivism for offenders or reoffending; and you see a big difference there. And so what

I have here is a bunch of charts that are very very simple to read, and they kind of

flow in how they're giving the overall impression and then detailing it a little bit more. There's

nothing fancy here, there's nothing interactive, there's nothing animated, there's nothing

kind of flowing in seventeen different directions. It's easy, but it follows a story and it tells

a narrative about the data and that should be your major goal with the presentation graphics.

In sum: presenting, or the graphics you use for presenting, are not the same as the graphics

you use for exploring. They have different needs and they have different goals. But no

matter what you are doing, be clear in your graphics and be focused in what you're trying

to tell. And above all create a strong narrative that gives different level of perspective

and answers questions as you go to anticipate a client's questions and to give them the

most reliable solid information and the greatest confidence in your analysis.

The final element

of data science and communicating that I wanted to talk about is reproducible research. And

you can think of it as this idea; you want to be able to play that song again. And the

reason for that is data science projects are rarely "one and done;" rather they tend to

be incremental, they tend to be cumulative, and they tend to adapt to these circumstances

that they're working in. So, one of the important things here, probably, if you want to summarize

it very briefly, is this: show your work. There's a few reasons for this. You may have

to revise your research at a later date, your own analyses. You may be doing another project

and you want to borrow something from previous studies.

More likely you'll have to hand it

off to somebody else at a future point and they're going to have to be able to understand

what you did. And then there's the very significant issue in both scientific and economic research

of accountability. You have to be able to show that you did things in a responsible

way and that your conclusions are justified; that's for clients funding agencies, regulators,

academic reviewers, any number of people.

Now, you may be familiar with the concept

of open data, but you may be less familiar with the concept of open data science; that's

more than open data. So, for instance, I'll just let you know there is something called

the Open Data Science Conference and ODSC.com. And it meets three times a year in different

places. And this is entirely, of course, devoted to open data science using both open data,

but making the methods transparent to people around them.

One thing that can make this

really simple is something called the Open Science Framework, which is at OSF.io. It's

a way of sharing your data and your research with an annotation on how you got through

the whole thing with other people. It makes the research transparent, which is what we

need. One of my professional organizations, the Association for Psychological Science

has a major initiative on this called open practices, where they are strongly encouraging

people to share their data as much as is ethically permissible and to absolutely share their

methods before they even conduct a study as a way of getting rigorous intellectual honesty

and accountability.

Now, another step in all of this is to archive your data, make that

information available, put it on the shelf. And what you want to do here is, you want

to archive all of your datasets; both the totally raw before you did anything with it

dataset, and every step in the process until your final clean dataset. Along with that,

you want to archive all of the code that you used in the process and analyzed the data.

If you used a programming language like R or Python, that's really simple. If you used

a program like SPSS you need to save the syntax files, and then that can be done that way.

And again, no matter what, make sure to comment liberally and explain yourself. Now, part

of that is you have to explain the process, because you are not just this lone person

sitting on the sofa working by yourself, you're with other people and you need to explain

why you did it the way that you did.

You need to explain the choices, the consequences of

those choices, the times that you had to backtrack and try it over again. This also works into

the principle of future-proofing your work. You want to do a few things here. Number one;

the data. You want to store the data in non-proprietary formats like a CSV or Comma Separated Values

file because anything can read CSV files. If you stored it in the proprietary SPSS.sav

format, you might be in a lot of trouble when somebody tries to use it later and they can't

open it.

Also, there's storage; you want to place all of your files in a secure, accessible

location like GitHub is probably one of the best choices. And then the code, you may want

to use something like a dependency management package like Packrat for R or Virtual Environment

for Python as a way of making sure that the packages that you use; that there are always

versions that work because sometimes things get updated and it gets broken. This is a

way of making sure that the system that you have will always work.

Overall, you can think

of this too: you want to explain yourself and a neat way to do that is to put your narrative

in a notebook. Now, you can have a physical lab book or you can also do digital books.

A really common one, especially if you're using Python, is Jupyter with a "y" there

in the middle. Jupyter notebooks are interactive notebooks. So, here's a screenshot of one

that I made in Python, and you have titles, you have text, you have the graphics. If you

are working in R, you can do this through something called RMarkdown. Which works in

the same way you do it in RStudio, you use Markdown and you can annotate it. You can

get more information about that at rmarkdown.rstudio.com. And so for instance, here's an R analysis

I did, and as you can see the code on the left and you see the markdown version on the

right.

What's neat about this is that this little bit of code here, this title and this

text and this little bit of R code, then is displayed as this formatted heading, as this

formatted text, and this turns into the entire R output right there. It's a great way to

do things. And if you do RMarkdown, you actually have the option of uploading the document

into something called RPubs; and that's an online document that can be made accessible

to anybody. Here's a sample document. And if you want to go see it, you can go to this

address. It's kind of long, so I am going to let you write that one down yourself. But,

in sum: here's what we have. You want to do your work and archive the information in a

way that supports collaboration. Explain your choices, say what you did, show how you did

it. This allows you to future-proof your work, so it will work in other situations for other

people.

And as much as possible, no matter how you do it, make sure you share your narrative

so people understand your process and they can see that your conclusions are justifiable,

strong and reliable. Now, something I've mentioned several times when talking about data science,

and I'll do it again in this conclusion, is that it's important to give people next steps.

And I'm going to do that for you right now. If you're wondering what to do after having

watched this very general overview course, I can give you a few ideas. Number one, maybe

you want to start trying to do some coding in R or Python; we have courses for those.

You might want to try doing some data visualization, one of the most important things that you

can do. You may want to brush up on statistics and maybe some math that goes along with it.

And you may want to try your hand at machine learning.

All of these will get you up and

rolling in the practice of data science. You can also try looking at data sourcing, finding

information that you are going to do. But, no matter what happens try to keep it in context.

So, for instance, data science can be applied to marketing, and sports, and health, and

education, and the arts, and really a huge number of other things. And we will have courses

here at datalab.cc that talk about all of those. You may also want to start getting

involved in the community of data science. One of the best conferences that you can go

to is O'Reilly Strata, which meets several times a year around the globe. There's also

Predictive Analytics World, again several times a year around the world. Then there's

much smaller conferences, I love Tapestry or tapestryconference.com, which is about

storytelling in data science.

And Extract, a one-day conference about data stories that

is put on by import.io, one of the great data sourcing applications that's available for

scraping web data. If you want to start working with actual data, a great choice is to go

to Kaggle.com and they sponsor data science competitions, which actually have cash rewards.

There's also wonderful data sets you can work with there to find out how they work and compare

your results to those of other people. And once you are feeling comfortable with that,

you may actually try turning around and doing some service; datakind.org is the premier

organization for data science as humanitarian service.

They do major projects around the

world. I love their examples. There are other things you can do; there's an annual event

called Do Good Data, and then datalab.cc will be sponsoring twice-a-year data charrettes,

which are opportunities for people in the Utah area to work with the local nonprofits

on their data. But above all of this, I want you to remember this one thing: data science

is fundamentally democratic. It's something that everybody needs to learn to do in some

way, shape or form. The ability to work with data is a fundamental ability and everybody

would be better off by learning to work with data intelligently and sensitively. Or, to

put it another way: data science needs you. Thanks so much for joining me in this introductory

course and I hope it has been good and I look forward to seeing you in the other courses

here at datalab.cc. Welcome to "Data Sourcing". I'm Barton Poulson and in this course, we're

going to talk about Data Opus or that's Latin for Data Needed.

The idea here is that no

data, no data science; and that is a sad thing. So, instead of leaving it at that we're going

to use this course to talk about methods for measuring and evaluating data and methods

for accessing existing data and even methods for creating new, custom data. Take those

together and it's a happy situation. At the same time, we'll do all of this still at an

accessible, conceptual and non-technical level because the technical hands-on stuff will

happen in later other courses. But for now, let's talk data. For data sourcing, the first

thing we want to talk about is measurement. And within that category, we're going to talk

about metrics. The idea here is that you actually need to know what your target is if you want

to have a chance to hit it.

There's a few particular reasons for this. First off, data

science is action-oriented; the goal is to do something as opposed to simply understand

something, which is something I say as an academic practitioner. Also, your goal needs

to be explicit and that's important because the goals can guide your effort. So, you want

to say exactly what you are trying to accomplish, so you know when you get there.

Also, goals

exist for the benefit of the client, and they can prevent frustration; they know what you're

working on, they know what you have to do to get there. And finally, the goals and the

metrics exist for the benefit of the analyst because they help you use your time well.

You know when you're done, you know when you can move ahead with something, and that makes

everything a little more efficient and a little more productive. And when we talk about this

the first thing you want to do is try to define success in your particular project or domain.

Depending on where you are, in commerce that can include things like sales, or click-through

rates, or new customers. In education it can include scores on tests; it can include graduation

rates or retention. In government, it can include things like housing and jobs. In research,

it can include the ability to serve the people that you're to better understand. So, whatever

domain you're in there will be different standards for success and you're going to need to know

what applies in your domain.

Next, are specific metrics or ways of measuring. Now again, there

are a few different categories here. There are business metrics, there are key performance

indicators or KPIs, there are SMART goals (that's an acronym), and there's also the

issue of having multiple goals. I'll talk about each of those for just a second now.

First off, let's talk about business metrics. If you're in the commercial world there are

some common ways of measuring success. A very obvious one is sales revenue; are you making

more money, are you moving the merchandise, are you getting sales. Also, there's the issue

of leads generated, new customers, or new potential customers because that, then, in

turn, is associated with future sales.

There's also the issue of customer value or lifetime

customer value, so you may have a small number of customers, but they all have a lot of revenue

and you can use that to really predict the overall profitability of your current system.

And then there's churn rate, which has to do with, you know, losing and gaining new

customers and having a lot of turnover. So, any of these are potential ways for defining

success and measuring it. These are potential metrics, there are others, but these are some

really common ones. Now, I mentioned earlier something called a key performance indicator

or KPI. KPIs come from David Parmenter and he's got a few ways of describing them, he

says a key performance indicator for business. Number one should be nonfinancial, not just

the bottom line, but something else that might be associated with it or that measures the

overall productivity of the association.

They should be timely, for instance, weekly, daily,

or even constantly gathered information. They should have a CEO focus, so the senior management

teams are the ones who generally make the decisions that affect how the organization

acts on the KPIs. They should be simple, so everybody in the organization, everybody knows

what they are and knows what to do about them. They should be team-based, so teams can take

joint responsibility for meeting each one of the KPIs. They should have significant

impact, what that really means is that they should affect more than one important outcome,

so you can do profitably and market reach or improved manufacturing time and fewer defects.

And finally, an ideal KPI has a limited dark side, that means there's fewer possibilities

for reinforcing the wrong behaviors and rewarding people for sort of exploiting the system.

Next, there are SMART goals, where SMART stands for Specific, Measurable, Assignable to a

particular person, Realistic (meaning you can actually do it with the resources you

have at hand), and Time-bound, (so you know when it can get done). So, whenever you form

a goal you should try to assess it on each of these criteria and that's a way of saying

that this is a good goal to be used as a metric for the success of our organization.

Now,

the trick, however, is when you have multiple goals, multiple possible endpoints. And the

reason that's difficult is because, well, it's easy to focus on one goal if you're just

trying to maximize revenue or if you're just trying to maximize graduation rate. There's

a lot of things you can do. It becomes more difficult when you have to focus on many things

simultaneously, especially because some of these goals may conflict. The things that

you do to maximize one may impair the other. And so when that happens, you actually need

to start engaging in a deliberate process of optimization, you need to optimize. And

there are ways that you can do this if you have enough data; you can do mathematical

optimization to find the ideal balance of efforts to pursue one goal and the other goal

at the same time. Now, this is a very general summary and let me finish with this. In sum,

metrics or methods for measuring can help awareness of how well your organization is

functioning and how well you're reaching your goals. There are many different methods available

for defining success and measuring progress towards those things.

The trick, however,

comes when you have to balance efforts to reach multiple goals simultaneously, which

can bring in the need for things like optimization. When talking about data sourcing and measurement,

one very important issue has to do with the accuracy of your measurements. The idea here

is that you don't want to have to throw away all your ideas; you don't want to waste effort.

One way of doing this in a very quantitative fashion is to make a classification table.

So, what that looks like is this, you talk about, for instance, positive results, negative

results… and in fact let's start by looking at the top here.

The middle two columns here

talk about whether an event is present, whether your house is on fire, or whether a sale occurs,

or whether you have got a tax evader, whatever. So, that's whether a particular thing is actually

happening or not. On the left here, is whether the test or the indicator suggests that the

thing is or is not happening.

And then you have these combinations of true positives;

where the test says it's happening and it really is, and false positives; where the

test says it happening, but it is not, and then below that true negatives, where the

test says it isn't happening and that's correct and then false negatives, where the test says

there's nothing going on, but there is in fact the event occurring. And then you start

to get the column totals, the total number of events present or absent, then the row

totals about the test results. Now, from this table what you get is four kinds of accuracy,

or really four different ways of quantifying accuracy using different standards. And they

go by these names: sensitivity, specificity, positive predictive value, and negative predictive

value. I'll show you very briefly how each of them works. Sensitivity can be expressed

this way, if there's a fire does the alarm ring? You want that to happen.

And so, that's

a matter of looking at the true positives and dividing that by the total number of alarms.

So, the test positive means there's an alarm and the event present means there's a fire;

you want it to always have an alarm when there's a fire. Specificity, on the other hand, is

sort of the flip side of this. If there isn't a fire, does the alarm stay quiet? This is

where you're looking at the ratio of true negatives to total absent events, where there's

no fire, and the alarms aren't ringing, and that's what you want.

Now, those are looking

at columns; you can also go sideways across rows. So, the first one there is positive

predictive value, often abbreviated as PPV, and we flip around the order a little bit.

This one says, if the alarm rings, was there a fire? So, now you're looking at the true

positives and dividing it by the total number of positives. Total number of positives is

any time the alarm rings. True positives are because there was a fire. And negative predictive

value, or NPV, says of the alarm doesn't ring, does that in fact mean that there is no fire?

Well, here you are looking at true negatives and dividing it by total negatives, the time

that it doesn't ring.

And again, you want to maximize that so the true negatives account

for all of the negatives, the same way you want the true positives to account for all

of the positives and so on. Now, you can put numbers on all of these going from zero percent

to a 100% and the idea is to maximize each one as much as you can. So, in sum, from these

tables we get four kinds of accuracy and there's a different focus for each one. But, the same

overall goal, you want to identify the true positives and true negatives and avoid the

false positives and the false negatives. And this is one of way of putting numbers on,

an index really, on the accuracy of your measurement.

Now data sourcing may seem like a very quantitative

topic, especially when we're talking about measurement. But, I want measure one important

thing here, and that is the social context of measurement. The idea here really, is that

people are people, and they all have their own goals, and they're going their own ways.

And we all have our own thoughts and feelings that don't always coincide with each other,

and this can affect measurement. And so, for instance, when you're trying to define your

goals and you're trying to maximize them you want to look at things like, for instance,

the business model. An organization's business model, the way they conduct their business,

the way they make their money, is tied to its identity and its reason to be.

And if

you make a recommendation and it'scontrary to their business model, that can actually

be perceived as a threat to their core identity, and people tend to get freaked out in that

situation. Also, restrictions, so for instance, there may be laws, policies, and common practices,

both organizationally and culturally, that may limit the ways the goals can be met. Now,

most of these make a lot of sense, so the idea is you can'tjust do anything you want,

you need to have these constraints. And when you make your recommendations, maybe you'll

work creatively in them as long as you're still behaving legally and ethically, but

you do need to be aware of these constraints. Next, is the environment. And the idea here

is that competition occurs both between organizations, that company here is trying to reach a goal,

but they're competing with company B over there, but probably even more significantly

there is competition within the organization.

This is really a recognition of office politics.

And when you, as a consultant, make a recommendation based on your analysis, you need to understand

you're kind of dropping a little football into the office and things are going to further

one person's career, maybe to the detriment of another. And in order for your recommendations

to have maximum effectiveness they need to play out well in the office. That's something

that you need to be aware of as you're making your recommendations. Finally, there's the

issue of manipulation. And a sad truism about people is that any reward system, any reward

system at all, will be exploited and people will generally game the system.

This happens

especially when you have a strong cut off; you need to get at least 80 percent, or you

get fired and people will do anything to make their numbers appear to be eighty percent.

This happens an awful lot when you look at executive compensation systems, it looks a

lot when you have very high stake school testing, it happens in an enormous number of situations,

and so, you need to be aware of the risk of exploitation and gaming. Now, don't think,

then, that all is lost. Don't give up, you can still do really wonderful assessment,

you can get good metrics, just be aware of these particular issues and be sensitive to

them as you both conduct your research and as you make your recommendations. So, in sum,

social factors affect goals and they affect the way you meet those goals. There are limits

and consequences, both on how you reach the goals and how, really, what the goal should

be and that when you're making advice on how to reach those goals please be sensitive to

how things play out with metrics and how people will adapt their behavior to meet the goals.

That way you can make something that's more likely to be implemented the way you meant

and more likely to predict accurately what can happen with your goals.

When it comes

to data sourcing, obviously the most important thing is to get data. But the easiest way

to do that, at least in theory, is to use existing data. Think of it as going to the

bookshelf and getting the data that you have right there at hand. Now, there's a few different

ways to do this: you can get in-house data, you can get open data, and you can get third-party

data. Another nice way to think of that is proprietary, public, and purchased data; the

three Ps I've heard it called. Let's talk about each of these a little bit more.

So,

in-house data, that's stuff that's already in your organization. What's nice about that,

it can be really fast and easy, it's right there and the format may be appropriate for

the kind of software in the computer that you are using. If you're fortunate, there's

good documentation, although sometimes when it's in-house people just kind of throw it

together, so you have to watch out for that. And there's the issue of quality control.

Now, this is true with any kind of data, but you need to pay attention with in-house, because

you don't know the circumstances necessarily under which people gathered the data and how

much attention they were paying to something.

There's also an issue of restrictions; there

may be some data that, while it is in-house, you may not be allowed to use, or you may

not be able to publish the results or share the results with other people. So, these are

things that you need to think about when you're going to use in-house data, in terms of how

can you use it to facilitate your data science projects. Specifically, there are a few pros

and cons. In-house data is potentially quick, easy, and free. Hopefully it's standardized;

maybe even the original team that conducted this study is still there. And you might have

identifiers in the data which make it easier for you to do an individual level analysis.

On the con side however, the in-house data simply may not exist, maybe it's just not

there.

Or the documentation may be inadequate and of course, the quality may be uncertain.

Always true, but may be something you have to pay more attention to when you're using

in-house data. Now, another choice is open data like going to the library and getting

something. This is prepared data that's freely available, consists of things like government

data and corporate data and scientific data from a number of sources. Let me show you

some of my favorite open data sources just so you know where they are and that they exist.

Probably, the best one is data.gov here in the US.

That is the, as it says right here,

the home of the US government's open data. Or, you may have a state level one. For instance,

I'm in Utah and we have data.utah.gov, also a great source of more regional information.

If you're in Europe, you have open-data.europa.eu, the European Union open data portal. And then

there are major non-profit organizations, so the UN has unicef.org/statistics for their

statistical and monitoring data. The World Health Organization has the global health

observatory at who.int/gho. And then there are private organizations that work in the

public interest, such as the Pew Research Center, which shares a lot of its data sets

and the New York Times, which makes it possible to use APIs to access a huge amount of the

data of things they've published over a huge time span.

And then two of the mother loads,

there's Google, which at google.com has public data which is a wonderful thing. And then

Amazon at aws.amazon.com/datasets has gargantuan datasets. So, if you needed a data set that

was like five terabytes in size, this is the place that you would go to get it. Now, there's

some pros and cons to using this kind of open data. First, is that you can get very valuable

datasets that maybe cost millions of dollars to gather and to process.

And you can get

a very wide range of topics and times and groups of people and so on. And often, the

data is very well formatted and well documented. There are, however, a few cons. Sometimes

there's biased samples. Say, for instance, you only get people who have internet access,

and that can mean, not everybody. Sometimes the meaning of the data is not clear or it

may not mean exactly what you want it to. A potential problem is that sometimes you

may need to share your analyses and if you are doing proprietary research, well, it's

going to have to be open instead, so that can create a crimp with some of your clients.

And then finally there are issues with privacy and confidentiality and in public data that

usually means that the identifiers are not there and you are going to have to work at

a larger aggregate level of measurement. Another option is to use data from a third-party,

these go by the name Data as a Service or DaaS.

You can also call them data brokers.

And the thing about data brokers is they can give you an enormous amount of data on many

different topics, plus they can save you some time and effort, by actually doing some of

the processing for you. And that can include things like consumer behaviors and preferences,

they can get contact information, they can do marketing identity and finances, there's

a lot of things. There's a number of data brokers around, here's a few of them.

Acxiom

is probably the biggest one in terms of marketing data. There's also Nielsen which provides

data primarily for media consumption. And there's another organization Datasift, that's

a smaller newer one. And there's a pretty wide range of choices, but these are some

of the big ones. Now, the thing about using data brokers, there's some pros and there's

some cons. The pros are first, that it can save you a lot of time and effort. It can

also give you individual level data which can be hard to get from open data. Open data

is usually at the community level; they can give you information about specific consumers.

They can even give you summaries and inferences about things like credit scores and marital

status.

Possibly even whether a person gambles or smokes. Now, the con is this, number 1

it can be really expensive, I mean this is a huge service; it provides a lot of benefit

and is priced accordingly. Also, you still need to validate it, you still need to double

check that it means what you think it means and that it works in with what you want. And

probably the real sticking point here is the use of third-party data is distasteful to

many people, and so you have to be aware that as you're making your choices. So, in sum,

as far as data sourcing existing data goes obviously data science needs data and there's

the three Ps of data sources, Proprietary and Public and Purchased.

But no matter what

source you use, you need to pay attention to quality and to the meaning and the usability

of the data to help you along in your own projects. When it comes to data sourcing,

a really good way of getting data is to use what are called APIs. Now, I like to think

of these as the digital version of Prufrock's mermaids. If you're familiar with the love

song on J. Alfred Prufrock by TS Eliot, he says, "I have heard the mermaids singing,

each to each," that's TS Eliot.

And I like to adapt that to say, "APIs have heard apps

singing, each to each," and that's by me. Now, more specifically when we talk about

an API, what we're talking about is something called Application Programming Interface,

and this is something that allows programs to talk to each other. Its most important

use, in terms of data science, is it allows you to get web data. It allows your program

to directly go to the web, on its own, grab the data, bring it back in almost as though

it were local data, and that's a really wonderful thing. Now, the most common version of APIs

for data science are called REST APIs; that stands for Representational State Transfer.

That's the software architectural style of the world wide web and it allows you to access

data on web pages via HTTP, that's the hypertext transfer protocol.

They, you know, run the

web as we know it. And when you download the data that you usually get its in JSON format,

that stands for Javascript Object Notation. The nice thing about that is that's human

readable, but it's even better for machines. Then you can take that information and you

can send it directly to other programs. And the nice thing about REST APIs is that they're

what is called language agnostic, meaning any programming language can call a REST API,

can get data from the web, and can do whatever it needs to with it. Now, there are a few

kinds of APIs that are really common. The first is what are called Social APIs; these

are ways of interfacing with social networks.

So, for instance, the most common one is Facebook;

there's also Twitter. Google Talk has been a big one and FourSquare as well and then

SoundCloud. These are on lists of the most popular ones. And then there are also what

are called Visual APIs, which are for getting visual data, so for instance, Google Maps

is the most common, but YouTube is something that accesses YouTube on a particular website

or AccuWeather which is for getting weather information.

Pinterest for photos, and Flickr

for photos as well. So, these are some really common APIs and you can program your computer

to pull in data from any of these services and sites and integrate it into your own website

or here into your own data analysis. Now, there's a few different ways you can do this.

You can program it in R, the statistical programming language, you can do it in Python, also you

can even use it in the very basic BASH command line interface, and there's a ton of other

applications.

Basically, anything can access an API one way or another. Now, I'd like to

show you how this works in R. So, I'm going to open up a script in RStudio and then I'm

going to use it to get some very basic information from a webpage. Let me go to RStudio and show

you how this works. Let me open up a script in RStudio that allows me to do some data

sourcing here. Now, I'm just going to use a package called JSON Lite, I'm going to load

that one up, and then I'm going to go to a couple of websites. I'm going to getting historical

data from Formula One car races and I'm going to be getting it from Ergast.com. Now, if

we go to this page right here, I can go straight to my browser right now.

And this is what

it looks like; it gives you the API documentation, so what you're doing for an API, is you're

just entering a web address and in that web address it includes the information you want.

I'll go back to R here just for a second. And if I want to get information about 1957

races in JSON format, I go to this address. I can skip over to that for a second, and

what you see is it's kind of a big long mess here, but it is all labeled and it is clear

to the computer what's going on here. Let's go back to R. And so what I'm going to do

is, I am going to save that URL into an object here, in R, and then I'm going to use the

command from JSON to read that URL and save it into R. And which it has now done. And

I'm going to zoom in on that so you can see what's happened. I've got this sort of mess

of text, this is actually a list object in R.

And then I'm going to get just the structure

of that object, so I'm going to do this one right here and you can see that it's a list

and it gives you the names of all the variables within each one of the lists. And what I'm

going to do is, I'm going to convert that list to a data frame.

I went through the list

and found where the information I wanted was located, you have to use this big long statement

here, that will give me the names of the drivers. Let me zoom in on that again. There they are.

And then I'm going to get just the column names for that bit of the data frame. So,

what I have here is six different variables. And then what I'm going to do is, I'm going

to pick just the first five cases and I'm going to select some variables and put them

in a different order.

And when I do that, this is what I get. I will zoom in on that

again. And the first five people listed in this data set that I pulled from 1957, are

Juan Fangio, makes sense one of the greatest drivers ever, and other people who competed

in that year. And so what I've done is by using this API call in R, a very simple thing

to do, I was able to pull data off that webpage in a structured format, and do a very simple

analysis with it.

And let's sum up what we've learned from all this. First off, APIs make

it really easy to work with web data, they structure, they call it for you, and then

they feed it straight into the program for you to analyze. And they are one of the best

ways of getting data and getting started in data science. When you're looking for data,

another great way of getting data is through scraping. And what that means is pulling information

from webpages. I like to think of it as when data is hiding in the open; it's there, you

can see it, but there's not an easy, immediate way to get that data. Now, when you're dealing

with scraping, you can get data in several different formats. You can get HTML text from

webpages, you can get HTML tables from the rows and columns that appear on webpages.

You can scrape data from PDFs, and you can scrape data from all sorts of data from images

and video and audio.

Now, we will make one very important qualification before we say

anything else: pay attention to copyright and privacy. Just because something is on

the web, doesn't mean you're allowed to pull it out. Information gets copyrighted, and

so when I use examples here, I make sure that this is stuff that's publicly available, and

you should do the same when you are doing your own analyses. Now, if you want to scrape

data there's a couple of ways to do it.

Number one, is to use apps that are developed for

this. So, for instance, import.io is one of my favorites. It is both a webpage, that's

its address, and it's a downloadable app. There's also ScraperWiki. There's an application

called Tabula, and you can even do scraping in Google Sheets, which I will demonstrate

in a second, and Excel. Or, if you don't want to use an app or if you want to do something

that apps don't really let you do, you can code your scraper. You can do it directly

in R, or Python, or Bash, or even Java or PHP.

Now, what you're going to do is you're

going to be looking for information on the webpage. If you're looking for HTML text,

what you're going to do is pull structured text from webpages, similar to how a reader

view works in a browser. It uses HTML tags on the webpage to identify what's the important

information. So, there's things like body, and h1 for header one, and p for paragraph,

and the angle brackets. You can also get information from HTML tables, although this is a physical

table of rows and columns I am showing you. This also uses HTML table tags, that is like

table, and tr for table row, and td for table data, that's the cell. The trick is when you're

doing this, you need the table number and sometimes you just have to find that through

trial and error.

Let me give you an example of how this works. Let's take a look at this

Wikipedia page on the Iron Chef America Competition. I'm going to go to the web right now and show

you that one. So, here we are in Wikipedia, Iron Chef America. And if you scroll down

a little bit, you see we have got a whole bunch of text here, we have got our table

of contents, and then we come down here, we have a table that lists the winners, the statistics

for the winners. And let's say we want to pull that from this webpage into another program

for us to analyze. Well, there is an extremely easy way to do this with Google Sheets. All

we need to do is open up the Google Sheet and in cell A1 of that Google Sheet, we paste

in this formula. It's IMPORTHTML, then you give the webpage and then you say that you

are importing a table, you have to put that stuff in quotes, and the index number for

the table.

I had to poke around a little bit to figure out this was table number 2. So,

let me go to Google Sheets and show you how this works. Here I have a Google Sheet and

right now it's got nothing in it. But watch this; if I come here to this cell, and I simply

paste in that information, all the stuff just sort of magically propagates into the sheet,

makes it extremely easy to deal with, and now I can, for instance, save this as a CSV

file, put it in another program. Lots of options. And so this is one way that I'm scraping the

data from a webpage because I didn't use an API, but I just used a very simple, one-link

command to get the information. Now, that was a HTML table. You can also scrape data

from PDFs. You have to be aware of if it's a native PDF, I call that a text PDF, or a

scanned or imaged PDF. And what it does with native PDFs, it looks for text elements; again

those are like code that indicates this is text. And you can deal with Raster images,

that's pixel images, or vector, which draws the lines, and that's what makes them infinitely

scalable in many situations.

And then in PDFs, you can deal with tabular data, but you probably

have to use a specialized program like Scraper, Wiki, or Tabula in order to get that. And

then finally media, like images and video and audio. Getting images is easy; you can

download them in a lot of different ways. And then if you want to read data from them,

say for instance, you have a heat map of a country, you can go through it, but you will

probably have to write a program that loops through the image pixel-by-pixel to read the

data and them encode it numerically into your statistical program.

Now, that's my very brief

summary and let's summarize that. First off, if the data you are trying to get at doesn't

have an existing API, you can try scraping and you can write code in a language like

R or Python. But, no matter what you do, be sensitive to issues of copyright and privacy,

so you don't get yourself in hot water, but instead, you make an analysis that can be

of use to you or to your client. The next step in data sourcing is making data.

And

specifically, we're talking about getting new data. I like to think of this as, you're

getting your hands on and you're getting "data de novo," new data. So, can't find the data

that you need for your analysis? Well, one simple solution is, do it yourself. And we're

going to talk about a few general strategies used for doing that. Now, these strategies

vary on a few dimensions. First off is the role.

Are you passive and simply observing

stuff that's happening already, or are you active where you play a role in creating the

situation to get the data? And then there's the "Q/Q question," and that is, are you going

to get quantitative, or numerical, data, or are you going to get qualitative data, which

usually means text, paragraphs, sentences as well as things like photos and videos and

audio? And also, how are you going to get the data? Do you want to get it online, or

do you want to get it in person? Now, there's other choices than these, but these are some

of the big delineators of the methods. When you look at those, you get a few possible

options.

Number one is interviews, and I'll say more about those. Another one is surveys.

A third one is card sorting. And a fourth one is experiments, although I actually want

to split experiments into two kinds of categories. The first one is laboratory experiments, and

that's in-person projects where you shape the information or an experience for the participants

as a way of seeing how that involvement changes their reactions. It doesn't necessarily mean

that you're a participant, but you create the situation. And then there's also A/B testing.

This is automated, online testing of two or more variations on a webpage. It's a very,

very simple kind of experimentation that's actually very useful for optimizing websites.

So, in sum, from this very short introduction make sure you can get exactly what you need.

Get the data you need to answer your question. And if you can't find it somewhere, then make

it. And, as always, you have many possible methods, each of which have their own strengths

and their own compromises. And we'll talk about each of those in the following sections.

The first method of data sourcing where you're making new data that I want to talk about

is interviews.

And that's not because it's the most common, but because it's the one

you would do for the most basic problem. Now, basically an interview is nothing more than

a conversation with another person or a group of people. And, the fundamental question is,

why do interviews as opposed to doing a survey or something else? Well, there's a few good

reasons to do that. Number one: you're working with a new topic and you don't know what people's

responses will be, how they'll react. And so you need something very open-ended. Number

two: you're working with a new audience and you don't know how they will react in particular

to what it is you're trying to do. And number three: something's going on with the current

situation, it's not working anymore, and you need to find what's going on, and you need

to find ways to improve.

The open-ended information where you get past you're existing categories

and boundaries can be one of the most useful methods for getting that data. If you want

to put it another way, you want to do interviews when you don't want to constrain responses.

Now, when it comes to interviews, you have one very basic choice, and that's whether

you do a structured interview. And with a structured interview, you have a predetermined

set of questions, and everyone gets the same questions in the same order.

It gives a lot

of consistency even though the responses are open-ended. And then you can also have what's

called an unstructured interview. And this is a whole lot more like a conversation where

you as the interviewer and the person you're talking to – your questions arise in response

to their answers. Consequently, an unstructured interview can be different for each person

that you talk to. Also, interviews are usually done in person, but not surprisingly, they

can be done over the phone, or often online. Now, a couple of things to keep in mind about

interviews.

Number one is time. Interviews can range from just a few minutes to several

hours per person. Second is training. Interviewing's a special skill that usually requires specific

training. Now, asking the questions is not necessarily the hard part. The really tricky

part is the analysis. The hardest part of interviews by far is analyzing the answers

for themes, and way of extracting the new categories and the dimensions that you need

for your further research. The beautiful thing about interviews is that they allow you to

learn things that you never expected. So, in sum: interviews are best for new situations

or new audiences. On the other hand, they can be time-consuming, and they also require

special training; both to conduct the interview, but also to analyze the highly qualitative

data that you get from them. The next logical step in data sourcing and making data is surveys.

Now, think of this: if you want to know something just ask.

That's the easy way. And you want

to do a survey under certain situations. The real question is, do you know your topic and

your audience well enough to anticipate their answers? To know what the range of their answers

and the dimensions and the categories that are going to be important. If you do, then

a survey might be a good approach. Now, just as there were a few dimensions for interviews,

there are a few dimensions for surveys. You can do what is called a closed-ended survey;

that is also called a forced choice. It is where you give people just particular options,

like a multiple choice. You can have an open-ended survey, where you have the same questions

for everybody, but you allow them to write in a free-form response. You can so surveys

in person and you can also do them online or over the mail or phone or however.

And

now, it is very common to use software when doing surveys. Some really common applications

for online surveys are SurveyMonkey, and Qualtrics, or at the very simple end there is Google

Forms, and the simple and pretty end there is Typeform. There is a lot more choices,

but these are some of the major players and how you can get data from online participants

in survey format. Now, the nice thing about surveys is, they are really easy to do, they

are very easy to set up and they are really easy to send out to large groups of people.

You can get tons of data really fast.

On the other hand, the same way that they are easy

to do, they are also really easy to do badly. The problem is that the questions you ask,

they can be ambiguous, they can be double-barreled, they can be loaded and the response scales

can be confusing. So, if you say, "I never think this particular way" and the person

puts strongly disagree, they may not know exactly what you are trying to get at. So,

you have to take special effort to make sure that the meaning is clear, unambiguous, and

that the rating scale, the way that people respond, is very clear and they know where

their answer falls. Which gets us into one of the things about people behaving badly

and that is beware the push poll. Now, especially during election time; like we are in right

now, a push poll is something that sounds like a survey, but really what it is is a

very biased attempt to get data, just fodder for social media campaigns or I am going to

make a chart that says that 98% of people agree with me.

A push poll is one that is

so biased, there is really only one way to answer to the questions. This is considered

extremely irresponsible and unethical from a research point of view. Just hang up on

them. Now, aside from that egregious violation of research ethics, you do need to do other

things like watch out for bias in the question wording, in the response options, and also

in the sample selection because any one of those can push your responses off one way

or another without you really being aware that it is happening.

So, in sum, let's say

this about surveys. You can get lots of data quickly, on the other hand, it requires familiarity

with the possible answers in your audience. So, you know, sort of, what to expect. And

no matter what you do, you need to watch for bias to make sure that your answers are going

to be representative of the group that you are really concerned about understanding.

An interesting topic in Data Sourcing when you are making data is Card Sorting. Now,

this isn't something that comes up very often in academic research, but in web research,

this can be a really important method. Think of it as what you are trying to do is like

building a model of a molecule here, you are trying to build a mental model of people's

mental structures.

Put more specifically, how do people organize information intuitively?

And also, how does that relate to the things that you are doing online? Now, the basic

procedure goes like this: you take a bunch of little topics and you write each one on

a separate card. And you can do this physically, with like three by five cards, or there are

a lot of programs that allow you to do a digital version of it. Then what you do is you give

this information to a group of respondents and the people sort those cards. So, they

put similar topics with each other, different topics over here and so on. And then you take

that information and from that you are able to calculate what is called, dissimilarity

data. Think of it as like the distance or the difference between various topics. And

that gives you the raw data to analyze how things are structured. Now, there are two

very general kinds of card sorting tasks. There are generative and there's evaluative.

A generative card sorting task is one in which respondents create their own sets, their own

piles of cards using any number of groupings they like.

And this might be used, for instance,

to design a website. If people are going to be looking for one kind of information next

to another one, then you are going to want to put that together on the website, so they

know where to expect it. On the other hand, if you've already created a website, then

you can do an evaluative card sorting. This is where you have a fixed number or fixed

names of categories. Like for instance, the way you have set up your menus already. And

then what you do is you see if people actually put the cards into these various categories

that you have created.

That's a way of verifying that your hierarchical structure makes sense

to people. Now, whichever method you do, generative or evaluative, what you end up with when you

do a card structure is an interesting kind of visualization called a Dendrogram. That

actually means branches. And what we have here is actually a hundred and fifty data

points; if you are familiar with the Fisher's Iris data, that's what's going on here. And

it groups it from one giant group on the left and then splits it in pieces and pieces and

pieces until you end up with lots of different observations, well actually, individual-level

observations at the end. But you can cut things off into two or three groups or whatever is

most useful for you here, as a way of visualizing the entire collection of similarity or dissimilarity

between the individual pieces of information that you had people sort. Now, I will just

mention very quickly if you want to do a digital card sorting, which makes your life infinitely

easier because keeping track of physical cards is really hard. You can use something like

Optimal Workshop, or UserZoom or UX Suite.

These are some of the most common choices.

Now, let's just sum up what we've learned about card sorting in this extremely brief

overview. Number one, card sorting allows you to see intuitive organization of information

in a hierarchical format. You can do it with physical cards or you can also have digital

choices for doing the same thing. And when you are done, you actually get this hierarchical

or branched visualization of how the information is structured and related to each other. When

you are doing your Data Sourcing and you are making data, sometimes you can't get what

you want through the easy ways, and you've got to take the hard way.

And you can do what

I am calling laboratory experiments. Now of course, when I mention laboratory experiments

people start to think of stuff like, you know, doctor Frankenstein in his lab, but lab experiments

are less like this and in fact they are a little more like this. Nearly every experiment

I have done in my career has been a paper and pencil one with people in a well-lighted

room and it's not been the threatening kind.

Now, the reason you do a lab experiment is

because you want to determine cause and effect. And this is the single most theoretically

viable way of getting that information. Now, what makes an experiment an experiment is

the fact that researchers play active roles in experiments with manipulations. Now, people

get a little freaked out when they hear manipulations, think that you are coercing people and messing

with their mind. All that means is that you are manipulating the situation; you are causing

something to be different for one group of people or for one situation than another.

It's a benign thing, but it allows you to see how people react to those different variations.

Now, you are going to want to do an experiment, you are going to want to have focused research,

it is usually done to test one thing or one variation at a time. And it is usually hypothesis-driven;

usually you don't do an experiment until you have done enough background research to say,

"I expect people to react this way to this situation and this way to the other." A key

component to all of this is that experiments almost always have random assignment regardless

of how you got your sample, when they are in your study, you randomly assign them to

one condition or another.

And what they does is it balances out the pre-existing differences

between groups and that's a great way of taking care of confounds and artifacts. The things

that are unintentionally associated with differences between groups that provide alternate explanations

for your data. If you have done good random assignment and you have a large enough group

of people than those confounds and artifacts are basically minimized.

Now, some places

where you are likely to see laboratory experiments in this version are for instance are eye tracking

and web design. This is where you have to bring people in front of a computer and you

stick a thing there that sees where they are looking. That's how we know for instance that

people don't really look at ads on the side of web pages. Another very common place is

research in medicine and education and in my field, psychology. And in all of these,

what you find is that experimental research is considered the gold standard for reliable

valid information about cause and effect. On the other hand, while it is a wonderful

thing to have, it does come at a cost. Here's how that works. Number 1, experimentation

requires extensive, specialized training. It is not a simple thing to pick up. Two,

experiments are often very time consuming and labor intensive. I have known some that

take hours per person. And number three, experiments can be very expensive. So, what that all means

is that you want to make sure that you have done enough background research and you need

to have a situation where it is sufficiently important to get really reliable cause and

effect information to justify these costs for experimentation.

In sum, laboratory experimentation

is generally considered the best method for causality or assessing causality. That's because

it allows you to control for confounds through randomization. On the other hand, it can be

difficult to do. So, be careful and thoughtful when considering whether you need to do an

experiment and how to actually go about doing it. There's one final procedure I want to

talk about in terms of Data Sourcing and Making New Data. It's a form of experimentation and

it is simply called A/B testing and it's extremely common in the web world.

So, for instance,

I just barely grabbed a screenshot of Amazon.com's homepage and you're got these various elements

on the homepage and I just noticed, by the way, when I did this that this woman is actually

an animated gif, so she moves around. That was kind of weird; I have never seen that

before. But the thing about this, is this entire layout, how things are organized and

how they are on there, will have been determined by variations on A/B testing by Amazon. Here's

how it works. For your webpage, you pick one element like what's the headline or what are

the colors or what's the organization or how do you word something and you create multiple

versions, maybe just two version A and version B, why you call it A/B testing.

Then when

people visit your webpage you randomly assign these visitors to one version or another,

you have software that does that for you automatically. And then you compare the response rates on

some response. I will show you those in a second. And then, once you have enough data,

you implement the best version, you sort of set that one solid and then you go on to something

else. Now, in terms of response rates, there are a lot of different outcomes you can look

at. You can look at how long a person is on a page, you can actually do mouse tracking

if you want to. You can look at click-throughs, you can also look at shopping cart value or

abandonment. A lot of possible outcomes. All of these contribute through A/B testing to

the general concept of website optimization; to make your website as effective as it can

possibly be.

Now, the idea also is that this is something that you are going to do a lot.

You can perform A/B tests continually. In fact, I have seen one person say that what

A/B testing really stands for is always be testing. Kind of cute, but it does give you

the idea that improvement is a constant process. Now, if you want some software to do A/B testing,

two of the most common choices are Optimizely and VWO, which stands for Visual Web Optimizer.

Now, many others are available, but these are especially common and when you get the

data you are going to use statistical hypothesis testing to compare the differences or really

the software does it for you automatically. But you may want to adjust the parameters

because most software packages cut off testing a little too soon and the information is not

quite as reliable as it should be.

But, in sum, here is what we can say about A/B testing.

It is a version of website experimentation; it is done online, which makes it really easy

to get a lot of data very quickly. It allows you to optimize the design of your website

for whatever outcome is important to you. And it can be done as a series of continual

assessments, testing, and development to make sure that you're accomplishing what you want

to as effectively as possible for as many people as possible.

The very last thing I

want to talk about in terms of data sourcing is to talk about the next steps. And probably

the most important thing is, you know, don't just sit there. I want you to go and see what

you already have. Try to explore some open data sources. And if it helps, check with

a few data vendors. And if those don't give you what you need to do your project, then

consider making new data. Again, the idea here is get what you need and get going. Thanks

for joining me and good luck on your own projects. Welcome to "Coding in Data Science". I'm Bart

Poulson and what we are going to do in this series of videos is we're going to take a

little look at the tools of Data Science. So, I am inviting you to know your tools,

but probably even more important than that is to know their proper place. Now, I mention

that because a lot of the times when people talk about data tools, they talk about it

as though that were the same thing as data science, as though they were the same set.

But, I think if you look at it for just a second that is not really the case.

Data tools

are simply one element of data science because data science is made up of a lot more than

the tools that you use. It includes things like, business knowledge, it includes the

meaning making and interpretation, it includes social factors and so there's much more than

just the tools involved. That being said, you will need at least a few tools and so

we're going to talk about some of the things that you can use in data science if it works

well for you. In terms of getting started, the basic things. #1 is spreadsheets, it is

the universal data tool and I'll talk about how they play an important role in data science.

#2 is a visualization program called Tableau, there is Tableau public, which is free, and

there's Tableau desktop and there is also something called Tableau server.

Tableau is

a fabulous program for data visualization and I'm convinced for most people provides

the great majority of what they need. And though while it is not a tool, I do need to

talk about the formats used in web data because, you have to be able to navigate that when

doing a lot of data science work. Then we can talk about some of the essential tools

for data science. Those include the programming language R, which is specifically for data,

there's the general purpose programming language Python, which has been well adapted to data.

And there's the database language sequel or SQL for structured query language. Then if

you want to go beyond that, there are some other things that you can do. There are the

general purpose programming languages C, C++, and Java, which are very frequently used to

form the foundation of data science and sort of high level production code is going to

rely on those as well.

There's the command line interface language Bash, which is very

common, a very quick tool for manipulating data. And then there's the, sort of wild card

supercharged regular expressions or Regex. We'll talk about all of these in separate

courses. But, as you consider all the tools that you can use, don't forget the 80/20 rule.

Also known as the Pareto Principle. And the idea here is that you are going to get a lot

of bang for your buck out of small number of things. And I'm going to show you a little

sample graph here. Imagine that you have ten different tools and we'll call them A through

B. A does a lot for you, B does a little bit less and it kind of tapers down to, you have

got a bunch of tools that do just a little of stuff that you need.

Now, instead of looking

at the individual effectiveness, look at the cumulative effectiveness. How much are you

able to accomplish with a combination of tools? Well, the first ones right here at 60% where

the tools started and then you add on the 20% from B and it goes up and then you add

on C and D and you add up little smaller, smaller pieces and by the time you get to

the end, you have got 100% of effectiveness from your ten tools combined.

The important

thing about this is, you only have to go to the 2nd tool, that is two out of ten, that's

B, that's 20% of your tools and in this made up example, you have got 80% of your output.

So, 80% of the output from 20% of the tools, that's a fictional example of the Pareto Principle,

but I find in real life it tends to work something approximately like that. And so, you don't

necessarily have to learn everything and you don't have to learn how to do everything in

everything. Instead you want to focus on the tools that will be most productive and specifically

most productive for you. So, in sum, let's say these three things. Number 1, coding or

simply the ability to manipulate data with programs and computers. Coding is important,

but data science is much greater than the collection of tools that's used in it.

And

then finally, as you're trying to decide what tools to use and what you need to learn and

how to work, remember the 80/20, you are going to get a lot of bang from a small set of tools.

So, focus on the things that are going to be most useful for you in conducting your

own data science projects. As we begin our discussion of Coding and Data Science, I actually

want to begin with something that's not coding. I want to talk about applications or programs

that are already created that allow you to manipulate data. And we are going to begin

with the most basic of these, spreadsheets. We're going to do the rows and columns and

cells of Excel.

And the reason for this is you need spreadsheets. Now, you may be saying

to yourself, "no no no not me, because you know what I'm fancy, I'm working in my big

set of servers, I've got fancy things going on." But, you know what, you too fancy people,

you need spreadsheets as well. There's a few reasons for this. Most importantly, spreadsheets

can be the right tool for data science in a lot of circumstances; there are a few reasons

for that. Number one, spreadsheets, they're everywhere, they're ubiquitous, they're installed

on a billion machines around the world and everybody uses them. They probably have more

data sets in spreadsheets than anything else, and so it's a very common format. Importantly,

it's probably your client's format; a lot of your clients are going to be using spreadsheets

for their own data. I've worked with billion dollar companies that keep all of their data

in spreadsheets. So, when you're working with them, you need to know how to manipulate that

and how to work with it.

Also, regardless of what you're doing, spreadsheets are specifically

csv – comma separated value files – are sort of the lingua franca or the universal interchange

format for data transfer, to allow you to take it from one program to another. And then,

truthfully, in a lot of situations they're really easy to use. And if you want a second

opinion on this, let's take a look at this ranking. There's a survey of data mining experts,

it's the KDnuggets data mining poll, and these are the tools they most use in their own work.

And look at this: lowly Excel is fifth on the list, and in fact, what's interesting

about it is it's above Hadoop and Spark, two of the major big data fancy tools. And so,

Excel really does have place of pride in a toolkit for data analyst. Now, since we're

going to sort of the low tech end of things, let's talk about some of the things you can

do with a spreadsheet.

Number one, they are really good for data browsing. You really

get to see all of the data in front of you, which isn't true if you are doing something

like R or Python. They're really good for sorting data, sort by this column then this

column then this column. They're really good for rearranging columns and cells and moving

things around. They're good for finding and replacing and seeing what happens so you know

that it worked right.

Some more uses they're really good for formatting, especially conditional

formatting. They're good for transposing data, switching the rows and the columns, they make

that really easy. They're good for tracking changes. Now it's true if you're a big fancy

data scientist you're probably using GitHub, but for everybody else in the world spreadsheets

and the tracking changes is a wonderful way to do it. You can make pivot tables, that

allows you to explore the data in a very hands-on way, in a very intuitive way. And they're

also really good for arranging the output for consumption. Now, when you're working

with spreadsheets, however, there's one thing you need to be aware of: they are really flexible,

but that flexibility can be a problem in that when you are working in data science, you

specifically want to be concerned about something called Tidy Data.

That's a term I borrowed

from Hadley Wickham, a very well-known developer in the R world. Tidy Data is for transferring

data and making it work well. There's a few rules here that undo some of the flexibility

inherent in spreadsheets. Number one, what you want to do is have a column be equivalent

to the same thing as a variable; columns, variables, they are the same thing. And then,

rows are equal – exactly the same thing as cases. That you have one sheet per file, and

that you have one level of measurement, say, individual, then organization, then state

per file.

Again, this is undoing some of the flexibility that's inherent in spreadsheets,

but it makes it really easy to move the data from one program to another. Let me show you

how all this works. You can try this in Excel. If you have downloaded the files for this

course, we simply want to open up this spreadsheet. Let me go to Excel and show you how it works.

So, when you open up this spreadsheet, what you get is totally fictional data here that

I made up, but it is showing sales over time of several products at two locations, like

if you're selling stuff at a baseball field. And this is the way spreadsheets often appear;

we've got blank rows and columns, we've got stuff arranged in a way that makes it easy

for the person to process it.

And we have got totals here, with formulas putting them

all together. And that's fine, that works well for the person who made it. And then,

that's for one month and then we have another month right here and then we have another

month right here and then we combine them all for first quarter of 2014. We have got

some headers here, we've got some conditional formatting and changes and if we come to the

bottom, we have got a very busy line graphic that eventually loads; it's not a good graphic,

by the way. But, similar to what you will often find. So, this is the stuff that, while

it may be useful for the client's own personal use, you can't feed this into R or Python,

it will just choke and it won't know what to do with it. And so, you need to go through

a process of tidying up the data. And what this involves is undoing some of the stuff.

So, for instance, here's data that is almost tidy. Here we have a single column for date,

a single column for the day, a column for the site, so we have two locations A and B,

and then we have six columns for the six different things that are sold and how many were sold

on each day.

Now, in certain situations, you would want the data laid out exactly like

this if you are doing, for instance, a time series, you will do something vaguely similar

to this. But, for true tidy stuff, we are going to collapse it even further. Let me

come here to the tidy data. And now what I have done is, I have created a new column

that says what is the item being sold. And so, by the way, what this means is that we

have got a really long data set now, it has got over a thousand rows. Come back up to

the top here. But, what that shows you is that now it's in a format that's really easy

to import from one program to another, that makes it tidy and you can re-manipulate it

however you want once you get to each of those.

So, let's sum up our little presentation here,

in a few lines. Number one, no matter who you are, no matter what you are doing in data

science you need spreadsheets. And the reason for that is that spreadsheets are often the

right tool for data science. Keep one thing in mind though, that is as you are moving

back and forth from one language to another, tidy data or well-formatted data is going

to be important for exporting data into your analytical programmer language of choice.

As we move through "Coding and Data Science," and specifically the applications that can

be used, there's one that stands out for me more than almost anything else, and that's

Tableau and Tableau Public.

Now, if you are not familiar with these, these are visualization

programs. The idea here is that when you have data, the most important thing you can do

is to first look and see what you have and work with it from there. And in fact, I'm

convinced that for many organizations Tableau might be all that they really need. It will

give them the level of insight that they need to work constructively with data. So, let's

take a quick look by going to tableau.com. Now, there are a few different versions of

Tableau. Right here we have Tableau Desktop and Tableau Server, and these are the paid

versions of Tableau. They actually cost a lot of money, unless you work for a nonprofit

organization, in which case you can get them for free. Which is a beautiful thing. What

we're usually looking for, however, is not the paid version, but we are looking for something

called Tableau Public.

And if you come in here and go to products and we have got these

three paid ones, over here to Tableau Public. We click on that, it brings us to this page.

It is public.tableau.com. And this is the one that has what we want, it's the free version

of Tableau with one major caveat: you don't save files locally to your computer, which

is why I didn't give you a file to open. Instead, it saves them to the web in a public form.

So, if you are willing to trade privacy, you can get an immensely powerful application

for data visualization. That's a catch for a lot of people, which is why people are willing

to pay a lot of money for the desktop version.

And again, if you work for a nonprofit you

can get the desktop version for free. But, I am going to show you how things work in

Tableau Public. So, that's something that you can work with personally. The first thing

you want to do is, you want to download it. And so, you put in your email address, you

download; it is going to know what you are on. It is a pretty big download. And once

it is downloaded, you can install and open up the application. And here I am in Tableau

Public, right here, this is the blank version. By the way, you also need to create an account

with Tableau in order to save your stuff online to see it. I will show you what that looks

like. But, you are presented with a blank thing right here and the first thing you need

to do is, you need to bring in some data. I'm going to bring in an Excel file.

Now,

if you downloaded the files for the course, you will see that there is this one right

here, DS03_2_2_TableauPublic.excel.xlsx. In fact, it is the one that I used in talking

about spreadsheets in the first video in this course. I'm going to select that one and I'm

going to open it. And a lot of programs don't like bringing in Excel because it's got all

the worksheets and all the weirdness in it. This one works better with it, but what I'm

going to do is, I am going to take the tidy data. By the way, you see that it put them

in alphabetical order here. I'm going to take tidy data and I'm going to drag it over to

let it know that it's the one that I want.

And now what it does is it shows me a version

of the data set along with things that you can do here. You can rename it, I like that

you can create bin groups, there's a lot of things that you can do here. I'm going to

do something very, very quick with this particular one. Now, I've got the data set right here,

what I'm going to do now is I'm going to go to a worksheet. That's where you actually

create stuff. Cancel that and go to worksheet one. Okay. This is a drag and drop interface.

And so what we are going to do is, we are going to pull the bits and pieces of information

we want to make graphics. There's immense flexibility here. I'm going to show you two

very basic ones.

I'm going to look at the sales of my fictional ballpark items. So,

I'm going to grab sales right here and I'm going to put that as the field that we are

going to measure. Okay. And you see, put it down right here and this is our total sales.

We're going to break it down by item and by time. So, let me take item right here, and

you can drag it over here, or I can put it right up here into rows. Those will be my

rows and that will be how many we have sold total of each of the items. Fine, that's really

easy. And then, let's take date and we will put that here in columns to spread it across.

Now, by default it is doing it by year, I don't want to do that, I want to have three

months of data. So, what I can do is, I can click right here and I can choose a different

time frame. I can go to quarter, but that's not going to help because I only have one

quarter's worth of data, that's three months.

I'm going to come down to week. Actually,

let me go to day. If I do day, you see it gets enormously complicated, so that's no

good. So, I'm going to back up to week. And I've got a lot of numbers there, but what

I want is a graph. And so, to get that, I'm going to come over here and click on this

and tell it that I want a graph. And so, we're seeing the information, except it lost items.

So, I'm going to bring item and put it back up into this graph to say this is a row for

the data. And now I've got rows for sales by week for each of my items. That's great.

I want to break it down one more by putting in the site, the place that it sold. So, I'm

going to grab that and I'm going to put it right over here. And now you see I've got

it broken down by the item that is sold and the different sites.

I'm going to color the

sites, and all I've got to do to do that is, I'm going to grab site and drag it onto color.

Now, I've got two different colors for my sites. And this makes it a lot easier to tell

what is going on. And in fact, there is some other cool stuff you can do. One of the things

I'm going to do is come over here to analytics and I can tell it to put an average line through

everything, so I'll just drag this over here. Now we have the average for each line. That's

good. And I can even do forecasting. Let me get a little bit of a forecast right here.

I will drag this on and if you can go over here.

I will get this out of the way for a

second. Now, I have a forecast for the next few weeks, and that's a really convenient,

quick, and easy thing. And again, for some organizations that might be all that they

really need. And so, what I'm showing you here is the absolute basic operation of Tableau,

which allows you to do an incredible range of visualizations and manipulate the data

and create interactive dashboards. There's so much to it and we'll show that in another

course, but for right now I want to show you one last thing about Tableau Public, and that

is saving the files.

So now, when I come here and save it, it's going to ask me to sign

into Tableau Public. Now, I sign in and it asks me how I want to save this, same name

as the video. There we go, and I'm going to hit save. And then that opens up a web browser,

and since I'm already logged into my account, see here's my account and my profile. Here's

the page that I created. And it's got everything that I need there; I'm going to edit just

a few details. I'm going to say, for instance, I'm going to leave its name just like that.

I can put more of a description in there if I wanted. I can allow people to download the

workbook and its data; I'm going to leave that there so you can download it if you need

to. If I had more than one tab, I would do this thing that says show the different sheets

as tabs.

Hit save. And there's my data set and also it's published online and people

can now find it. And so what you have here is an incredible tool for creating interactive

visualizations; you can create them with drop-down menus, and you can rearrange things, and you

can make an entire dashboard. It's a fabulous way of presenting information, and as I said

before, I think that for some organizations this may be as much as they need to get really

good, useful information out of their data. And so I strongly recommend that you take

some time to explore with Tableau, either the paid desktop version or the public version

and see what you can do to get some really compelling and insightful visualizations out

of your work in data science.

For many people, their first experience of "Coding and Data

Science" is with the application SPSS. Now, I think of SPSS and the first thing that comes

to my mind is sort of life in the Ivory tower, though this looks more like Harry Potter.

But, if you think about it the package name SPSS comes from Statistical Package for the

Social Sciences. Although, if you ask IBM about it now, they act like it doesn't stand

for anything. But, it has its background in social science research which is generally

academic. And truthfully, I'm a social psychologist and that's where I first learned how to use

SPSS. But, let's take a quick look at their webpage ibm.com/spss. If you type that in,

that will just be an alias that will take you to IBM's main webpage. Now, IBM didn't

create SPSS, but they bought it around version 16, and it was very briefly known as PASW

predictive analytic software, that only lasted briefly and now it's back to SPSS, which is

where it's been for a long time. SPSS is a desktop program; it's pretty big, it does

a lot of things, it's very powerful, and is used in a lot of academic research.

It's also

used in a lot of business consulting, management, even some medical research. And the thing

about SPSS, is it looks like a spreadsheet but has drop-down menus to make your life

a little bit easier compared to some of the programming languages that you can use. Now,

you can get a free temporary version, if you're a student you can get a cheap version, otherwise

SPSS costs a lot of money. But, if you have it one way or another, when you open it up

this is what it is going to look like. I'm showing SPSS version 22, now it's currently

on 24. And the thing about SPSS versioning is, in anything other than software packaging,

these would be point updates, so I sort of feel like we should be on 17.3, as opposed

to 23 or 24. Because the variations are so small that anything you learn from the early

ones, is going to work on the later ones and there is a lot of backwards and forwards compatibility,

so I'd almost say that this one, the version I have practically doesn't matter. You get

this little welcome splash screen, and if you don't want to see it anymore you can get

rid of it.

I'm just going to hit cancel here. And this is our main interface. It looks a

lot like a spreadsheet, the difference is, you have a separate pane for looking at variable

information and then you have separate windows for output and then an optional one for something

called Syntax. But, let me show you how this works by first opening up a data set. SPSS

has a lot of sample data sets in them, but they are not easy to get to and they are really

well hidden. On my Mac, for instance, let me go to where they are. In my mac I go to

the finder, I have to go to Mac, to applications, to the folder IBM, to SPSS, to statistics,

to 22 the version number, to samples, then I have to say I want the ones that are in

English, and then it brings them up.

The .sav files are the actual data files, there are

different kinds in here, so .sav is a different kind of file and then we have a different

one about planning analyses. So, there are versions of it. I'm going to open up a file

here called "market values .sav," a small data set in SPSS format. And if you don't

have that, you can open up something else; it really doesn't matter for now. By the way,

in case you haven't noticed, SPSS tends to be really really slow when it opens.

It also,

despite being version 24, it tends to be kind of buggy and crashes. So, when you work with

SPSS, you want to get in the habit of saving your work constantly. And also, being patient

when it is time to open the program. So, here is a data set that just shows addresses and

house values, and square feet for information. This, I don't even know if this is real information,

it looks artificial to me. But, SPSS lets you do point and click analyses, which is

unusual for a lot of things. So, I am going to come up here and I am going to say, for

instance, make a graph.

I'm going to make a- I'm going to use what is called a legacy

dialogue to get a histogram of house prices. So, I simply click values. Put that right

there and I will put a normal curve in top of it and click ok. This is going to open

up a new window, and it opened up a microscopic version of it, so I'm going to make that bigger.

This is the output window, this is a separate window and it has a navigation pane here on

the side. It tells me where the data came from, and it saves the command here, and then,

you know, there's my default histogram. So, we see most of the houses were right around

$125,000, and then they went up to at least $400,000. I have a mean of $256,000, a standard

deviation of about $80,000, and then there is 94 houses in the data set.

Fine, that's

great. The other thing I can do is, if I want to do some analyses, let me go back to the

data just for a moment. For instance, I can come here to analyze and I can do descriptive

and I'm actually going to do one here called Explore. And I'll take the purchase price

and I'll put it right here and I'm going to get a whole bunch just by default. I'm going

to hit ok. And it goes back to the output window. Once again made it tiny. And so, now

you see beneath my chart I now have a table and I've got a bunch of information. A stem

and leaf plot, and a box plot too, a great way of checking for outliers. And so this

is a really convenient way to save things.

You can export this information as images,

you can export the entire file as an HTML, you can do it as a pdf or a PowerPoint. There's

a lot of options here and you can customize everything that's on here. Now, I just want

to show you one more thing that makes your life so much easier in SPSS. You see right

here that it's putting down these commands, it's actually saying graph, and then histogram,

and normal equals value. And then down here, we've got this little command right here.

Most people don't know how to save their work in SPSS, and that's something you kind of

just have to do it over again every time, but there's a very simple way to do this.

What I'm going to do is, I'm going to open up something called a Syntax file. I'm going

to go to new, Syntax. And this is just a blank window that's a programming window, it's for

saving code. And let me go back to my analysis I did a moment ago.

I'll go back to analyze

and I can still get at it right here. Descriptives and explore, my information is still there.

And what happens here is, even though I set it up with drop-down menus and point and click,

if I do this thing, paste, then what it does is, it takes the code that creates that command

and it saves it to this syntax window. And this is just a text file. It saves it as .spss,

but it is a text file that can be opened in anything. And what's beautiful about this

is, it is really easy to copy and paste, and you can even take this into Word and do a

find and replace on it, and it's really easy to replicate the analyses. And so for me,

SPSS is a good program. But, until you use Syntax you don't know the true power of it

and it makes your life so much easier as a way of operating it. Anyhow, this is my extremely

brief introduction to SPSS. All I want to say is that it is a very common program, kind

of looks like a spreadsheet, but it gives you a lot more power and options and you can

use both drop-down menus and text-based Syntax commands as well to automate your work and

make it easier to replicate it in the future.

I want to take a look at one more application

for "Coding and Data Science", that's called JASP. This is a new application, not very

familiar to a lot of people and still in beta, but with an amazing promise. You can basically

think of it as a free version of SPSS and you know what, we love free. But, JASP is

not just free, it's also open source, and it's intuitive, and it makes analyses replicable,

and it even includes Bayesian approaches. So, take that all together, you know, we're

pretty happy and we're jumping for joy. So, before we move on, you just may be asking

yourself, JASP, what is that? Well, the creator has emphatically denied that it stands for

Just Another Statistics Program, but be that as it may, we will just go ahead and call

it JASP and use it very happily. You can get to it by going to jasp-stats.org. And let's

take a look at that right now.

JASP is a new program, they say a low fat alternative to

SPSS, but it is a really wonderful great way of doing statistics. You're going to want

to download it, by supplying your platform; it even comes in Linux format, which is beautiful.

And again, it's beta so stay posted, things are updating regularly. If you're on Mac,

you're going to need to use Xquartz, that's an easy thing to install and it makes a lot

of things work better. And it's the wonderful way to do analyses.

When you open up JASP,

it's going to look like this. It's a pretty blank interface, but it's really easy to get

going with it. So for instance, you can come over here to file and you can even choose

some example data sets. So for instance, here's one called Big 5 that's personality factors.

And you've got data here that's really easy to work with.

Let me scroll this over here

for a moment. So, there's our five variables and let's do some quick analyses with these.

Say for instance, we want to get descriptives; we can pick a few variables. Now, if you're

familiar with SPSS, the layout feels very much the same and the output looks a lot the

same. You know, all I have to do is select what I want and it immediately pops up over

here. Then I can choose additional statistics, I can get core tiles, I can get the median.

And you can choose plots; let's get some plots, all you have to do is click on it and they

show up. And that's a really beautiful thing and you can modify these things a little bit,

so for instance, I can take the plot points. Let's see if I can drag that down and if I

make it small enough I can see the five plots, I went a little too far on that one. Anyhow,

you can do a lot of things here. And I can hide this, I can collapse that and I can go

on and do other analyses.

Now, what's really neat though is when I navigate away, so I

just clicked in a blank area of the results page, we are back to the data here. But if

I click on one of these tables, like this one right here, it immediately brings up the

commands that produced it and I can just modify it some more if I want. Say I want skewness

and kurtosis, boom they are in there. It is an amazing thing and then I can come back

out here, I can click away from that and I can come down to the plots expand those and

if I click on that it brings up the commands that made them.

It's an amazingly easy and

intuitive way to do things. Now, there's another really nice thing about JASP and that is that

you can share the information online really well through a program called osf.io. That

stands for the open science foundation, that's its web address osf.io. So, let's take a quick

look at what that's like. Here's the open science framework website and it's a wonderful

service, it's free and it's designed to support open, transparent, accessible, accountable,

collaborative research and I really can't say enough nice things about it. What's neat

about this is once you sign up for OSF you can create your own area and I've got one

of my own, I will go to that now. So, for instance, here's the datalab page in open

science framework. And what I've done is i created a version of this JASP analysis and

I've saved it here, in fact, let's open up my JASP analysis in JASP and I'll show you

what it looks like in osf.

So, let's first go back to JASP. When we're here we can come

over to file and click computer and I just saved this file to the desktop. Click on desktop,

and you should have been able to download this with all the other files, DS03_2_4_JASP,

double click on that to open it and now it's going to open up a new window and you see

I was working with the same data set, but I did a lot more analyses. I've got these

graphs; I have correlations and scatter plots. Come down here, I did a linear regression.

And we just click on that and you can see the commands that produce it as well as the

options. I didn't do anything special for that, but I did do some confidence intervals

and specified that and it's really a great way to work with all this. I'll click back

in an empty area and you see the commands go away and so I've got my output here in

JASP, but when I saved it though, I had the option of saving it to OSF, in fact if you

go to this webpage osf.io/3t2jg you'll actually be able to go to the page where you can see

and download the analyses that I conducted, let's take a look.

This is that page, there's

the address I just barely gave you and what you see here is the same analysis that I conducted,

it's all right here, so if you're collaborating with people or if you want to show things

to people, this is a wonderful way to do it. Everything is right there, this is a static

image, but up at the top people have the option of downloading the original file and working

with it on their own. In case you can't tell, I'm really enthusiastic about JASP and about

its potential, still in beta, still growing rapidly. I see it really as an open source

free and collaborative replacement to SPSS and I think it is going to make data science

work so much easier for so many people. I strongly recommend you give JASP a close look.

Let's finish up our discussion of "Coding and Data Science" the applications part of

it by just briefly looking at some other software choices.

And I'll have to admit it gets kind

of overwhelming because there are just so many choices. Now, in addition to the spreadsheets,

and Tableau, and SPSS, and JASP, that we have already talked about, there's so much more

than that. I'm going to give you a range of things that I'm aware of and I'm sure I've

left out some important ones or things that other people like really well, but these are

some common choices and some less common, but interesting ones. Number one, in terms

of ones that I haven't mentioned is SAS. SAS is an extremely common analytical program,

very powerful, used for a lot of things. It's actually the first program that I learned

and on the other hand it can be kind of hard to use and it can be expensive, but there's

a couple of interesting alternatives. SAS also has something called the SAS University

Edition, if you're a student this is free and it's slightly reduced in what it does,

but the fact that it's free.

And also it runs in a virtual machine which makes it an enormous

download, but it's a good way to learn SAS if it's something that you want to do. SAS

also makes a program that I really love were it not so extraordinarily expensive and that

is called JMP and its visualization software. Think a little bit of Tableau, how we saw

it, you work with it visually and this one you can drag things around, it's really wonderful

program.

I personally find it prohibitively expensive. Another very common choice among

working analysts is Stata and some people use Minitab. Now, for mathematical people,

there's MATLAB and then of course there's Mathematica itself, but it is really more

of a language than a program. On the other hand, Wolfram; who makes Mathematica, is also

the people who give us Wolfram Alpha, most people don't think of this a stats application

because you can run it on your iPhone. But, Wolfram Alpha is an incredibly capable and

especially if you pay for the pro account, you can do amazing things in this, including

analyses, regression models, visualizations and so it's worth taking a little closer look

at that. Also, because it provides a lot of the data that you need so Wolfram Alpha is

an interesting one. Now, several applications that are more specifically geared towards

data mining, so you don't want to do your regular, you know, little t tests and stuff

on these.

But, there's RapidMiner and there's KNIME and Orange and those are all really

nice to use because they are control languages where you drag notes onto a screen and you

connect them with lines and you can see how things run through. All three of them are

free or have free versions and all three of them work in pretty similar manners. There's

also BigML, which is for machine learning and this is unusual because it's browser based,

it runs on their servers. There's a free version, though you can't download a whole lot, it

doesn't cost a lot to use BigML and it's a very friendly, very accessible program. Then

in terms of programs you can actually install for free on your own computer, there's one

call SOFA Statistics, it means statistics open for all, it's kind of a cheesy title,

but it's a good program.

And then one with a web page straight out of 1990 is Past 3,

this is paleontological software, on the other hand does do very general stuff, it runs on

many platforms and it's a really powerful thing and it's free, but it is relatively

unknown. And then speaking of relatively unknown, one that's near and dear to my heart is a

web application called Statcrunch, it costs, but it costs like $6 or $12 a year, it's really

cheap and it's very good, especially if for basic statistics and for learning, I used

in some of the classes that I was teaching. And then if you're deeply wedded to Excel

and you can't stand to leave that environment, you can purchase add-ons like XLSTAT, which

give you a lot of statistical functions within the Excel environment itself.

That's a lot

of choices and the most important thing here is don't get overwhelmed. There's a lot of

choices, but you don't even have to try all of them. Really the important question is

what works best for you and the project that you're working on? Here's a few things you

want to consider in that regard. First off is functionality, does it actually do what

you want or does it even run on your machine? You don't need everything that a program can

do.

When you think about the stuff Excel can do, people probably use five percent of what's

available. Second is ease of use. Some of these programs are a lot easier to use than

the others and I personally find that the ones that are easier to use, I like them,

so you might say, "No, I need to program because I need custom stuff". But I'm willing to bet

that 95% of what people do does not require anything custom.

Also, the existence of a

community. Constantly when you're working you come across problems and don't know how

to solve it and being able to get online and do a search for an answer and have enough

of a community that there are people there who have put answers up and discuss these

things. Those are wonderful. Some of these programs are very substantial communities

and some of them it is practically nonexistent and it is to you to decide how important it

is to you. And then finally of course there is the issue of cost. Many of these programs

I mentioned are free, some of them are very cheap, some of them run some sort of premium

model and some of them are extremely expensive. So, you don't buy them unless somebody else

is paying for it. So, these are some of the things that you want to keep in mind when

you're trying to look at various programs. Also, let's mention this; don't forget the

80/20 rule. You're going to be able to do most of the stuff that you need to do with

only a small number of tools, one or two, maybe three, will probably be all that you

ever need.

So, you don't need to explore the range of every possible tool. Find something

that you need, find something you're comfortable with and really try to extract as much value

as you can out of that. So, in sum, in our discussion of available applications for coding

and data science. First remember applications are tools, they don't drive you, you use them.

And that your goals are what drive the choice of your applications and the way that you

do it. And the single most important thing is to remember, what works for you, may work

well for somebody else, if you're not comfortable with it, if it's not the questions you address,

then it's more important to think about what works for you and the projects that you're

working on as you make your own choices for tools, for working in data science.

When you're

"Coding in Data Science," one of the most important things you can do is be able to

work with web data. And if you work with web data you're going to be working with HTML.

And in case you're not familiar with it, HTML is what makes the World Wide Web go ‘round.

What it stands for is HyperText Markup Language – and if you've never dealt with web pages

before, here's a little secret: web pages are just text. It is just a text document,

but it uses tags to define the structure of the document and a web browser knows what

those tags are and it displays them the right way. So, for instance, some of the tags, they

look like this. They are in angle brackets, and you have an angle bracket and then the

beginning tag, so body, and then you have the body, the main part of your text, and

then you have in angle brackets with backslash body to let the computer know that you are

done with that part.

You also have p and backslash p for paragraphs. H1 is for header one and

you put it in between that text. TD is for table data or the cell in a table and you

mark it off that way. If you want to see what it looks like just go to this document: DS03_3_1_HTML.txt.

I'm going to go to that one right now. Now, depending on what text editor you open this

up, it may actually give you the web preview. I've opened it up in TextMate and so it actually

is showing the text the way I typed it. I typed this manually; I just typed it all in

there. And I have HTML to see what a document is, I have an empty header, but that sort

of needs to be there.

This, I say what the body is, and then I have some text. li is

for list items, I have headers, this is for a link to a webpage, then I have a small table.

And if you want to see what this looks like when displayed as a web page, just go up here

to window and show web preview. This is the same document, but now it is in a browser

and that's how you make a web page. Now, I know this is very fundamental stuff, but the

reason this is important is because if you're going to be extracting data from the web,

you have to understand how that information is encoded in the web, and it is going to

be in HTML most of the time for a regular web page. Now, I will mention something that,

there's another thing called CSS.

Web pages use CSS to define the appearance of a document.

HTML is theoretically there to give the content and CSS gives the appearance. And that stands

for Cascading Style Sheets. I'm not going to worry about that right now because we're

really interested in the content. And now you have the key to being able to read web

pages and pull data from web pages for your data science project. So, in sum; first, the

web runs on HTML and that's what makes the web pages that are there. HTML defines the

page structure and the content that is on the page. And you need to learn how to navigate

the tags and the structure in order to get data from the web pages for your data science

projects.

The next step in "Coding and Data Science" when you're working with web data

is to understand a little bit about XML. I like to think of this as the part of web data

that follows the imperative, "Data, define thyself". XML stands for eXtensible Markup

Language, and what it is XML is semi-structured data. What that means is that tags define

data so a computer knows what a particular piece of information is. But, unlike HTML,

the tags are free to be defined any way you want. And so you have this enormous flexibility

in there, but you're still able to specify it so the computer can read it. Now, there's

a couple of places where you're going to see XML files. Number one is in web data. HTML

defines the structure of a web page, but if they're feeding data into it, then that will

often come in the form of an XML file.

Interestingly, Microsoft Office files, if you have .docx

or .xlsx, the X-part at the end stands for a version of XML that's used to create these

documents. If you use iTunes, the library information that has all of your artists,

and your genre's, and your ratings and stuff, that's all stored in an XML file. And then

finally, data files that often go with particular programs can be saved as XML as a way of representing

the structure of the data to the program.

And for XML, tags use opening and closing

angle brackets just like HTML did. Again, the major difference is that you're free to

define the tags however you want. So for instance, thinking about iTunes, you can define a tag

that's genre, and you have the angle brackets in genre to begin that information, and then

you have the angle brackets with the backslash to let it know you're done with that piece

of information. Or, you can do it for composer, or you can do it for rating, or you can do

it for comments, and you can create any tags you want and you put the information in between

those two things. Now, let's take an example of how this works. I'm going to show you a

quick dataset that comes from the web. It's at ergast.com and API, and this is a website

that stores information about automobile Formula One racing. Let's go to this webpage and take

a quick look at what it's like.

So, here we are at Ergast.com, and it's the API for Formula

One. And what I'm bringing up is the results of the 1957 season in Formula One racing.

And here you can see who the competitors were in each race, and how they finished and so

on. So, this is a dataset that is being displayed in a web page. If you want to see what it

looks like in XML, all you have to do is type XML onto the end of this: .XML. I've done

that already, so I'm just going to go to that one. And as you see, it's only this bit that

I've added: .XML.

Now, it looks exactly the same because the web page is structuring XML

data by default but if you want to see what it looks like in its raw format, just do an

option, click on the web page, and go to view page source. At least that's how it works

in Chrome, and this is the structured XML page. And you can see we have tags here. It

says Race Name, Circuit Name, Location, and obviously, these are not standard HTML tags.

They are defined for the purposes of this particular dataset. But we begin with one.

We have Circuit Name right there, and then we close it using the backslash right there.

And so this is structured data; the computer knows how to read it, which is exactly, this

is how it displays it by default.

So, it's a really good way of displaying data and its

a good way to know how to pull data from the web. You can actually use what is called an

API, an Application Programming interface to access this XML data and it pulls it in

along with its structure which makes working with it really easy. What's even more interesting

is how easy it is to take XML data and convert it between different formats, because it's

structured and the computer knows what you're dealing with. So for example, one it's really

easy to convert XML to CSV or comma separated value files (that's the spreadsheet format)

because it knows exactly what the headings are; what piece of information goes in each

column.

Example two: it's really easy to convert HTML documents to XML because you can think

of HTML with its restricted set of tags as sort of a subset of the much freer XML. And

three, you can convert CSV, or your spreadsheet comma separated value, to XML and vice versa.

You can bounce them all back and forth because the structure is made clear to the programs

you're working with. So in sum, here's what we can say. Number one, XML is semi-structured

data. What that means is that it has tags to tell the computer what the piece of information

is, but you can make the tags whatever you want them to be. And, XML is very common for

web data and it's really easy to translate the format XML/HTML/CSV so on and so forth.

It's really easy to translate them back and forth which gives you a lot of flexibility

in manipulating data so can get into the format you need for your own analysis.

The last thing

I want to mention about "Coding and Data Science" and web data is something called JSON. And

I like to think of it as a version of smaller is better. Now, what JSON stands for is JavaScript

Object Notation, although JavaScript is supposed to be one word. And what it is, is that like

XML, JSON is semi-structured data. That is, you have tags that define the data, so the

computer knows what each piece of information is, but like XML the tags can vary freely.

And so there's a lot in common between XML and JSON. So XML is a Markup Language (that's

what the ML stands for), and that gives meaning to the text; it lets the computer know what

each piece of information is. Also, XML allows you to make comments in the document, and

it allows you to put metadata in the tags so you can actually put information there

in the angle brackets to provide additional context.

JSON, on the other hand, is specifically

designed for data interchange and so it's got that special focus. And the structure;

JSON corresponds with data structures, you know it directly represents objects and arrays

and numbers and strings and booleans, and that works really well with the programs that

are used to analyze data. Also, JSON is typically shorter than XML because it does not require

the closing tags. Now, there are ways to do that with XML, but that's not typically how

it's done. As a result of these differences, JSON is basically taking XML's place in web

data. XML still exists, it's still used for a lot of things, but JSON is slowly replacing

it. And we'll take a look at the comparison between the three by going back to the example

we used in XML.

This is data about Formula One car races in 1957 from ergast.com. You

can just go to the first web page here, then we will navigate to the others from that.

So this is the general page. This is if you just type in without the .XML or .JSON or

anything. So it's a table of information about races in 1957. And we saw earlier that if

you add just add .XML to the end of this, it looks exactly the same.

That's because

this browser is displaying XML properly by default. But, if you were to right click on

it, and go to view page source, you would get this instead, and you can see the structure.

This is still XML, and so everything has an opening tag and a closing tag and some extra

information in there. But, if you type in .JSON what you really get is this jumbled

mess. Now that's unfortunate because there is a lot of structure to this. So, what I

am going to do is, I am actually going to copy all of this data, then I'm going to go

to a little web page; there's a lot of things you can do here, and it's a cute phrase.

It's

called JSON Pretty Print. And that is, make it look structured so it's easier to read.

I just paste that in there and hit Pretty Print JSON, and now you can see hierarchical

structure of the data. The interesting thing is that the JSON tags only have tags at the

beginning. It says series in quotes, then a colon, then it gives the piece of information

in quotes, and a comma and it moves on to the next one. And this is a lot more similar

to the way data would be represented in something like R or Python. It is also more compact.

Again, there are things you can do with XML but this is one of the reasons that JSON is

becoming preferred as a data carrier for websites. And as you may have guessed, it's really easy

to convert between the formats.

It's easy to convert between XML, JSON, CSV, etc. You

can get a web page where you can paste a version in and you get the other version out. There

are some differences, but for the vast majority of situations, they are just interchangeable.

In Sum: what did we get from this? Like XML, JSON is semi-structured data, where there

are tags that say what the information is, but you define the tags however you want.

JSON is specifically designed for data interchange and because it reflects the structure of the

data in the programs, that makes it really easy.

Also, because it's relatively compact

JSON is replacing gradually XML on the web, as the container for data on web pages. If

we are going to talk about "Coding and Data Science" and the languages that are used,

then first and foremost is R. The reason for that is, according to many standards, R is

the language of data and data science. For example, take a look at this chart. This is

a ranking based on a survey of data mining experts of the software they use in doing

their work, and R is right there at the top. R is first, and in fact that's important because

there's Python which is usually taken hand in hand with R for Data Science. But R sees

50% more use than Python does, at least in this particular list. Now there's a few reasons

for that popularity. Number one, R is free and it's open source, both of which make things

very easy.

Second, R is specially developed for vector operations. That means it's able

to go through an entire list of data without having to write ‘for' loops to go through.

If you've ever had to write ‘for' loops, you know that would be kind of disastrous

having to do that with data analysis. Next, R has a fabulous community behind it. It's

very easy to get help on things with R, you Google it, you're going to end up in a place

where you're going to be able to find good examples of what you need. And probably most

importantly, R is very capable. R has 7,000 packages that add capabilities to R. Essentially,

it can do anything. Now, when you are working with R, you actually have a choice of interfaces.

That is, how you actually do the coding and how you get your results. R comes with it's

own IDE or Interactive Development Environment. You can do that, or if you are on a Mac or

a Linux you can actually do R through the Terminal through the command line. If you've

installed R, you just type R and it starts up.

There is also a very popular development

environment called RStudio.com, and that's actually the one I use and the one I will

be using for all my examples. But another new competitor is Jupyter, which is very commonly

used for Python; that's what I use for examples there. It works in a browser window, even

though its locally installed. And RStudio and Jupyter there's pluses and minus to each

one of them and I'll mention them as we get to each one of them. But no matter which interface

you use, R's command line, you're typing lines of code in order to get the commands.

Some

people get really scared about that but really there are some advantages to that in terms

of the replicability and really the accessibility, the transparency of your commands. So for

instance, here's a short example of some of the commands in R. You can enter them into

what is called a console, and that's just one line at a time and that's called an interactive

way. Or you can save scripts and run bits and pieces selectively and that makes your

life a lot easier. No matter how you do it, if you are familiar with programming other

languages then you're going to find that R's a little weird.

It has an idiosyncratic model.

It makes sense once you get used to it, but it is a different approach, and so it takes

some adaptation if you are accustomed to programming in different languages. Now, once you do your

programming to get your output, what you're going to get is graphs in a separate window.

You're going to get text and numbers, numerical output in the console, and no matter what

you get, you can save the output to files. So that makes it portable, you can do it in

other environments. But most importantly, I like to think of this: here's our box of

chocolates where you never know what you're going to get.

The beauty of R is in the packages

that are available to expand its capabilities. Now there are two sources of packages for

R. One goes by the name of CRAN, and that stands for the Comprehensive R Archive Network,

and that's at cran.rstudio.com. And what that does is takes the 7,000 different packages

that are available and organizes them into topics that they call task views. And for

each one if they have done their homework, they have datasets that come along with the

package. You have a manual in .pdf format, and you can even have vignettes where they

run through examples of how to do it. Another interface is called Crantastic! And the exclamation

point is part of the title. And that is at crantastic.org. And what this is, is an alternative

interface that links to CRAN. So if you find something you like in Crantastic! and you

click on the link, it's going to open in CRAN. But the nice thing about Crantastic! is it

shows the popularity of packages, and it also shows how recently they were updated, and

that can be a nice way of knowing you're getting sort of the latest and greatest.

Now from

this very abstract presentation, we can say a few things about R: Number one, according

to many, R is the language of data science and it's a command line interface. You're

typing lines of code, so that gives it both a strength and a challenge for some people.

But the beautiful thing is that for the thousands and thousands of packages of additional code

and capability that are available for R, that make it possible to do nearly anything in

this statistical programming language. When, talking about "Coding and Data Science" and

the languages, along with R, we need to talk about Python.

Now, Python the snakes is a

general-purpose program that can do it all, and that's its beauty. If we go back to the

survey of the software used by data mining experts, you see that Python's there and it's

number three on the list. What's significant about that, is that on this list, Python is

the only general purpose programming language. It's the only one that can be theoretically

used to develop any kind of application that you want. That gives it some special powers

compared to all the others, most of which are very specific to data science work. The

nice things about Python are: number one, it's general purpose.

It's also really easy

to use, and if you have a Macintosh or Linux computer, Python is built into it. Also, Python

has a fabulous community around it with hundreds of thousands of people involved, and also

python has thousands of packages. Now, it actually has 70 or 80,000 packages, but in

terms of ones that are for data, there are still thousands available that give it some

incredible capabilities.

A couple of things to know about Python. First, is about versions.

There are two versions of Python that are in wide circulation: there's 2.x; so that

means like 2.5 or 2.6, and 3.x; so 3.1, 3.2. Version 2 and version 3 are similar, but they

are not identical. In fact, the problem is this: there are some compatibility issues

where code that runs in one does not run in the other. And consequently, most people have

to choose between one and the other. And what this leads to is that many people still use

2.x. I have to admit, in the examples that I use, I'm using 2.x because so many of the

data science packages that are developed with that in mind. Now let me say a few things

about the interfaces for Python.

First, Python does come with its own Interactive Development

Learning Environment and they call it IDLE. You can also run it from the Terminal, or

command line interface, or any IDE that you have. A very common and a very good choice

is Jupyter. Jupyter is a browser-based framework for programming and it was originally called

IPython. That served as its initial, so a lot of the time when people are talking about

IPython, what they are really talking about is this Python in Jupyter and the two are

sometimes used interchangeably. One of the neat things you can do, there are two companies:

Continuum and Enthought. Both of which have made special distributions of Python with

hundreds and hundreds of packages preconfigured to make it very easy to work with data. I

personally prefer Continuum Anaconda, it's the one that I use, a lot of other people

use it, but either one is going to work and it's going to get you up and running.

And

like I said with R, no matter what interface you use, all of them are command line. You're

typing lines of code. Again, there is tremendous strength to that but, it can be intimidating

to some people at first. In terms of the actual commands of Python, we have some examples

here on the side, and the important thing to remember is that it's a text interface.

On the other hand, Python is familiar to millions of people because it is very often a first

programming language people learn to do general purpose programming. And there are a lot of

very simple adaptations for data that make it very powerful for data science work. So,

let me say something else again: data science loves Jupyter, and Jupyter is the browser-based

framework.

It's a local installation, but you access it through a web browser that makes

it possible to really do some excellent work in data science. There's a few reasons for

this. When you're working in Jupyter you get text output and you can use what's called

Markdown as a way of formatting documents. You can get inline graphics for the graphics

to show up directly beneath the code that you did it. It's also really easy to organize,

present, and to share analyses that are done in Jupyter. Which makes it a strong contender

for your choices in how you do data science programming. Another one of the beautiful

things about Python, like R, is there are thousands of packages available. In Python,

there is one main repository; it goes by the name PyPI. Which is for the Python Package

Index.

Right here it says there are over 80,000 packages and 7 or 8,000 of those are for data-specific

purposes. Some of the packages that you will get to be very familiar with are NumPy and

SciPy, which are for scientific computing in general; Matplotlib and a development of

it called Seaborn are for data visualization and graphics. Pandas is the main package for

the doing statistical analysis. And for machine learning, almost nothing beats scikit-learn.

And when I go through hands-on examples in Python, I will be using all of these as a

way of demonstrating the power of the program for working with data. In sum we can say a

few things: Python is a very popular program very familiar to millions of people and that

makes it a good choice.

Second, of all the languages we use for data science on a frequent

basis, this is the only one that's general purpose. Which means it can be used for a

lot of things other than processing data. And it gets its power, like R does, from having

thousands of contributed packages which greatly expand its capabilities especially in terms

of doing data science work. A choice for "Coding in Data Science," one of the languages that

may not come immediately to mind when they think data science, is Sequel or SQL. SQL

is the language of databases and we think, "why do we want to work in SQL?" Well, to

paraphrase the famous bank robber Willie Sudden who apparently explained why he robbed banks

and said: "Because that's where the money is." The reason we would with SQL in data

science is because that's where the data is. Let's take another look at our ranking of

software among data mining professionals, and there's SQL. Third on the list, and also

of this list, its also the first database tool.

Other tools, for instance, get much

fancier, and much new and shinier, but SQL has been around for a while as very very capable.

There's a few things to know about SQL. You will notice that I am saying Sequel even though

it stands for Structured Query Language. SQL is a language, not an application. There's

not a program SQL, it's a language that can be used in different applications. Primarily,

SQL is designed for what are called relational databases. And those are special ways of storing

structured data that you can pull in.

You can put things together, you can join them

in special ways, you can get summary statistics, and then what you usually do is then export

that data into your analytical application of choice. The big word here is RDBMS – Relational

Database Management System; that is where you will usually see SQL as a query language

being used. In terms of Relational Database Management System, there are a few very common

choices. In the industrial world where people have some money to spend, there's Oracle database

is a very common one and Microsoft SQL Server. In the open source world, two very common

choices are MySQL, even though we generally say Sequel, when it's here you generally say

MySQL. Another one is PostgreSQL. These are both open source, free versions of the language;

sort of dialects of each, that make it possible for you to working with your databases and

for you to get your information out. The neat thing about them, no matter what you do, databases

minimize data redundancy by using connected tables. Each table has rows and columns and

they store different levels or different of abstraction or measurement, which means you

only have to put the information one place and then it can refer to lots of other tables.

Makes it very easy to keep things organized and up to date.

When you are looking into

a way of working with a Relational Database Management System, you get to choose in part

between using a graphical user interface or GUI. Some of those include SQL Developer and

SQL Server Management Studio, two very common choices. And there are a lot of other choices

such as Toad and some other choices that are graphical interfaces for working with these

databases. There are also text-based interfaces. So really, any command line interface, and

any interactive development environment or programming tool is going to be able to do

that. Now, you can think of yourself on the command deck of your ship and think of a few

basic commands that are very important for working with SQL. There are just a handful

of commands that can get you where you need to go. There is the Select command, where

you're choosing the cases that you want to include. From: says what tables are you going

to be extracting them from. Where: is a way of specifying conditions, and then Order By:

obviously is just a way of putting it all together.

This works because usually when

you are in a SQL database you're just pulling out the information. You want to select it,

you want to organize it, and then what you are going to do is you are going to send the

data to your program of choice for further analysis, like R or Python or whatever. In

sum here's what we can say about SQL: Number one, as a language it's generally associated

with relational databases, which are very efficient and well-structured ways of storing

data. Just a handful of basic commands can be very useful when working with databases.

You don't have have to be a super ninja expert, really a handful. Five, 10 commands will probably

get you everything you need out of a SQL database. Then once the data is organized, the data

is typically exported to some other program for analysis. When you talk about coding in

any field, one of the languages or one of the groups of languages that come up most

often are C, C++, and Java.

These are extremely powerful applications and very frequently

used for professional, production level coding. In data science, the place where you will

see these languages most often is in the bedrock. The absolute fundamental layer that makes

the rest of data science possible. For instance, C and C++. C is from the ‘60s, C++ is from

the ‘80s, and they have extraordinary wide usage, and their major advantage is that they're

really really fast. In fact, C is usually used as the benchmark for how fast is a language.

They are also very, very stable, which makes them really well suited to production-level

code and, for instance, server use.

What's really neat is that in certain situations,

if time is really important, if speeds important, then you can actually use C code in R or other

statistical languages. Next is Java. Java is based on C++, it's major contribution was

the WORA or the Write Once Run Anywhere. The idea that you were going to be able to develop

code that is portable to different machines and different environments. Because of that,

Java is the most popular computer programming language overall against all tech situations.

The place you would use these in data science, like I said, when time is of the essence,

when something has to be fast, it has to get the job accomplished quickly, and it has to

not break.

Then these are the ones you're probably going to use. The people who are

going to use it are primarily going to be engineers. The engineers and the software

developers who deal with the inner workings of the algorithms in data science or the back

end of data science. The servers and the mainframes and the entire structure that makes analysis

possible. In terms of analysts, people who are actually analyzing the data, typically

don't do hands-on work with the foundational elements. They don't usually touch C or C++,

more of the work is on the front end or closer to the high-level languages like R or Python.

In sum: C, C++ and Java form a foundational bedrock in the back end of data and data science.

They do this because they are very fast and they are very reliable. On the other hand,

given their nature that work is typically reserved for the engineers who are working

with the equipment that runs in the back that makes the rest of the analysis possible. I

want to finish our extremely brief discussion of "Coding in Data Sciences" and the languages

that can be used, by mentioning one other that's called Bash.

Bash really is a great

example of old tools that have survived and are still being used actively and productively

with new data. You can think of it this way, it's almost like typing on your typewriter.

You're working at the command line, you're typing out code through a command line interface

or a CLI. This method of interacting with computers practically goes back to the typewriter

phase, because it predates monitors. So, before you even had a monitor, you would type out

the code and it would print it out on a piece of paper.

The important thing to know about

the command line is it's simply a method of interacting. It's not a language, because

lots of languages can run at the command line. For instance, it is important to talk about

the concept of a shell. In computer science, a shell is a language or something that wraps

around the computer. It's a shell around the language, that is the interaction level for

the user to get things done at the lower level that aren't really human-friendly.

On Mac

computers and Linux, the most common is Bash, which is short for Bourne Again Shell. On

Windows computers, the most common is PowerShell. But whatever you do there actually are a lot

of choices, there's the Bourne Shell, the C shell; which is why I have a seashell right

here, the Z shell, there's fish for Friendly Interactive Shell, and a whole bunch of other

choices. Bash is the most common on Mac and Linux and PowerShell is the most common on

Windows as a method of interacting with the computer at the command line level.

There's

a few things you need to know about this. You have a prompt of some kind, in Bash, it's

a dollar sign, and that just means type your command here. Then, the other thing is you

type one line at a time. It's actually amazing how much you can get done with a one-liner

program, by sort of piping things together, so one feeds into the other. You can run more

complex commands if you use a script. So, you call a text document that has a bunch

of things in it and you can get much more elaborate analyses done. Now, we have our

tools here. In Bash we talk about utilities and what these are, are specific programs

that accomplish specific tools. Bash really thrives on "Do one thing, and do it very well."

There are two general categories of utilities for Bash. Number one, is the Built-ins. These

are the ones that come installed with it, and so you're able to use it anytime by simply

calling in their name. Some more common ones are: cat, which is for catenate; that's to

put information together.

There's awk, which is it's own interpreted language, but it's

often used for text processing from the command line. By the way, the name 'Awk' comes from

the initials of the people who created it. Then there's grep, which is for Global search

with a Regular Expression and Print. It's a way of searching for information. And then

there's sed, which stands for Stream Editor and its main use is to transform text. You

can do an enormous amount with just these 4 utilities. A few more are head & tail, display

the first or last 10 lines of a document. Sort & uniq, which sort and count the number

of unique answers in a document. Wc, which is for word count, and printf which formats

the output that you get in your console. And while you can get a huge amount of work done

with just this small number of built-in utilities, there are also a wide range of installable.

Or, other command line utilities that you can add to Bash, or whatever programming language

you're using.

So, since some really good ones that have been recently developed are jq:

which is for pulling in JSON or JavaScript, object notation data from the web. And then

there's json2csv, which is a way of converting JSON to csv format, which is what a lot of

statistical programs are going to be happy with. There's Rio which allows you to run

a wide range of commands from the statistical programming language R in the command line

as part of Bash. And then there's BigMLer. This is a command line tool that allows you

to access BigML's machine learning servers through the command line. Normally, you do

it through a web browser and it accesses their servers remote. It's an amazingly useful program

but to be able to just pull it up when you're in the command line is an enormous benefit.

What's interesting is that even though you have all these opportunities, all these different

utilities, you can do all amazing things. And there's still an active element of utilities

for the command line. So, in sum: despite being in one sense as old as the dinosaurs,

the command line survives because it is extremely well evolved and well suited to its purpose

of working with data.

The utilities; both the built-in and the installable are fast

and they are easy. In general, they do one thing and they do it very, very well. And

then surprisingly, there is an enormous amount of very active development of command line

utilities for these purposes, especially with data science. One critical task when you are

Coding in Data Science is to be able to find the things that you are looking for, and Regex

(which is short of Regular Expressions) is a wonderful way to do that. You can think

of it as the supercharged method for finding needles in haystacks. Now, Regex tends to

look a little cryptic so, for instance, here's an example. As something that's designed to

determine if something is a valid email address, and it specifies what can go in the beginning,

you have the at sign in the middle, then you've got a certain number of letters and numbers,

then you have to have a dot something at the end.

And so, this is a special kind of code

for indicating what can go where. Now regular expressions, or regex, are really a form of

pattern matching in text. And it's a way of specifying what needs to be where, what can

vary, and how much it can vary. And you can write both specific patterns; say I only want

a one letter variation here, or a very general like the email validator that I showed you.

And the idea here is that you can write this search pattern, your little wild card thing,

you can find the data and then once you identify those cases, then you export them into another

program for analysis. So here's a short example of how it can work. What I've done is taken

some text documents, they're actually the texts to Emma and to Pygmalion, two books

I got off of Project Gutenberg, and this is the command. Grep ^l.ve *.txt – so what I'm

looking for in either of these books are lines that start with ‘l', then they can have

one character; can be whatever, then that's followed by ‘ve', and then the .txt means

search for all the text files in the particular folder.

And what it found were lines that

began with love, and lived, and lovely, and so on. Now in terms of the actual nuts and

bolts of regular expressions, there are some certain elements. There are literals, and

those are things that are exactly what they mean. You type the letter ‘l', you're looking

for the letter ‘l'. There are also metacharacters, which specify, for instance, things need to

go here; they're characters but are really code that give representations. Now, there

are also escape sequences, which is normally this character is used as a variable, but

I want to really look for a period as opposed to a placeholder. Then you have the entire

search expression that you create and you have the target string, the thing that it

is searching through.

So let me give you a few very short examples. ^ this is the caret.

This is the sometimes called a hat or in French, a circonflexe. What that means, you're looking

for something at the beginning of the search you are searching. For example, you can have

^ and capital M, that means you need something that begins with capital M. For instance the

word "Mac," true, it will find that. But if you have iMac, it's a capital M, but it's

not the first letter and so that would be false, it won't find that. The $ means you

are looking for something at the end of the string.

So for example: ing$ that will find

the word ‘fling' because it ends in ‘ing', but it won't find the word ‘flings' because

it actually ends with an ‘s'. And then the dot, the period, simply means that we are

looking for one letter and it can be anything. So, for example, you can write ‘at.'. And

that will find ‘data' because it has an ‘a', a ‘t', and then one letter after

it.

But it won't find ‘flat', because ‘flat' doesn't have anything after the ‘at'. And

so these are extremely simple examples of how it can work. Obviously, it gets more complicated

and the real power comes when you start combining these bits and elements. Now, one interesting

thing about this is you can actually treat this as a game. I love this website, it's

called Regex golf and it's at regex.alf.nu. And what it does is brings up lists of words;

two columns, and your job is to write a regular expression in the top, that matches all the

words on the left column and none of the words on the right.

And uses the fewest characters

possible, and you get a score! And it's a great way of learning how to do regular expressions

and learning how to search in a way that is going to get you the data you need for your

projects. So, in sum: Regex, or regular expressions, help you find the right data for your project,

they're very powerful and they're very flexible. Now, on the other hand, they are cryptic,

at least when you first look at them but at the same time, it's like a puzzle and it can

be a lot of fun if you practice it and you see how you can find what you need.

I want

to thank you for joining me in "Coding in Data Science" and we'll wrap up this course

by talking about some of the specific next steps you can take for working in data science.

The idea here, is that you want to get some tools and you want to start working with those

tools. Now, please keep in mind something that I've said at another time.

Data tools

and data science are related, they're important but don't make the mistake of thinking that

if you know the tools that you have done the same thing as actually conducted data science.

That's not true, people sometimes get a little enthusiastic and they get a little carried

away. What you need to remember is the relationship really is this: Data Tools are an important

part of data science, but data science itself is much bigger than just the tools. Now, speaking

of tools remember there's a few kinds that you can use, and that you might want to get

some experience with these. #1, in terms of just Apps, specific built applications Excel

& Tableau are really fundamental for both getting the data from clients or doing some

basic data browsing and Tableau is really wonderful for interactive data visualization.

I strongly recommend you get very comfortable with both of those.

In terms of code, it's

a good idea to learn either ‘R' or ‘Python' or ideally to learn both. Ideally because

you can use them hand in hand. In terms of utilities, it's a great idea to work with

Bash, the command line utility and to use regular expression or regex. You can actually

use those in lots and lots of programs; regular expressions. So they can have a very wide

application. And then finally, data science requires some sort of domain expertise. You're

going to need some sort of field experience or intimate understanding of a particular

domain and the challenges that come up and what constitutes workable answers and the

kind of data that's available. Now, as you go through all of this, you don't need to

build this monstrous list of things. Remember, you don't need everything. You don't need

every tool, you don't need every function, you don't need every approach. Instead remember,

get what's best for your needs, and for your style. But no matter what you do, remember

that tools are tools, they are a means to an end.

Instead, you want to focus on the

goal of your data science project whatever it is. And I can tell you really, the goal

is in the meaning, extracting meaning out of your data to make informed choices. In

fact, I'll say a little more. The goal is always meaning. And so with that, I strongly

encourage you to get some tools, get started in data science and start finding meaning

in the data that's around you. Welcome to "Mathematics in Data Science". I'm Barton

Poulson and we're going to talk about how Mathematics matters for data science. Now,

you maybe saying to yourself, "Why math?", and "Computers can do it, I don't need to

do it". And really fundamentally, "I don't need math I am just here to do my work". Well,

I am here to tell you, No. You need math. That is if you want to be a data scientist,

and I assume that you do.

So we are going to talk about some of the basic elements of

Mathematics, really at a conceptual level and how they apply to data science. There

are few ways that math really matters to data science. #1, it allows you to know which procedures

to use and why. So you can answer your questions in a way that is the most informative and

the most useful. #2, if you have a good understanding of math, then you know what to do when things

don't work right. That you get impossible values or things won't compute, and that makes

a huge difference. And then #3, an interesting thing is that some mathematical procedures

are easier and quicker to do by hand then by actually firing up the computer. And so

for all 3 of these reasons, it's really helpful to have at least a grounding in Mathematics

if you're going to do work in data science. Now probably the most important thing to start

with in Algebra.

And there are 3 kinds of algebra I want to mention. The first is elementary

algebra, that's the regular x+y. Then there is Linear or matrix algebra which looks more

complex, but is conceptually it is used by computers to actually do the calculations.

And then finally I am going to mention Systems of Linear Equations where you have multiple

equations simultaneously that you're trying to solve. Now there's more math than just

algebra. A few other things I'm going to cover in this course. Calculus, a little bit of

Big O or order which has to do with the speed and complexity of operations. A little bit

of probability theory and a little bit of Bayes or Bayes theorem which is used for getting

posterior probabilities and changes the way you interpret the results of an analysis.

And for the purposes of this course, I'm going to demonstrate the procedures by hand, of

course you would use software to do this in the real world, but we are dealing with simple

problems at conceptual levels.

And really, the most important thing to remember is that

even though a lot of people get put off by math, really You can do it! And so, in sum:

let's say these three things about math. First off, you do need some math to do good data

science. It helps you diagnose problems, it helps you choose the right procedures, and

interestingly you can do a lot of it by hand, or you can use software computers to do the

calculations as well. As we begin our discussion of the role of "Mathematics and Data Science",

we'll of course begin with the foundational elements.

And in data science nothing is more

foundational than Elementary Algebra. Now, I'd like to begin this with really just a

bit of history. In case you're not aware, the first book on algebra was written in 820

by Muhammad ibn Musa al-Khwarizmi. And it was called "The Compendious Book on Calculation

by Completion and Balancing". Actually, it was called this, which if you transliterate

that comes out to this, but look at this word right here. That's the algebra, which means

Restoration. In any case, that's where it comes from and for our concerns, there are

several kinds of algebra that we're going to talk about. There's Elementary Algebra,

there's Linear Algebra and there are systems of linear equations. We'll talk about each

of those in different videos. But to put it into context, let's take an example here of

salaries.

Now, this is based on real data from a survey of the salary of people employed

in data science and to give a simple version of it. The salary was equal to a constant,

that's sort of an average value that everybody started with and to that you added years,

then some measure of bargaining skills and how many hours they worked per week.

And that

gave you your prediction, but that wasn't exact there's also some error to throw into

it to get to the precise value that each person has. Now, if you want to abbreviate this,

you can write it kind of like this: S + C + Y + B + H + E, although it's more common

to write it symbolically like this, and let's go through this equation very quickly. The

first thing we have is outcome,; we call that y the variable y for person i, "i" stands

for each case in our observations. So, here's outcome y for person i. This letter here,

is a Greek Beta and it represents the intercept or the average, that's why it has a zero,

because we don't multiply it times anything. But right next to it we have a coefficient

for variable 1. So Beta, which means a coefficient, sub 1 for the first variable and then we have

variable 1 then x 1, means variable 1, then i means its the score on that variable for

person i, whoever we are talking about.

Then we do the same thing for variables 2 and 3,

and at the end, we have a little epsilon here with an i for the error term for person i,

which says how far off from the prediction was their actual score. Now, I'm going to

run through some of these procedures and we'll see how they can be applied to data science.

But for right now let's just say this in sum.

First off, Algebra is vital to data science.

It allows you to combine multiple scores, get a single outcome, do a lot of other manipulations.

And really, the calculations, their easy for one case at at time. Especially when you're

doing it by hand. The next step for "Mathematics for Data Science" foundations is to look at

Linear algebra or an extension of elementary algebra. And depending on your background,

you may know this by another name and I like to think welcome to the Matrix. Because it's

also known as matrix algebra because we are dealing with matrices . Now, let's go back

to an example I gave in the last video about salary. Where salary is equal to a constant

plus years, plus bargaining, plus hours plus error, okay that's a way to write it out in

words and if you want to put it in symbolic form, it's going to look like this. Now before

we get started with matrix algebra, we need to talk about a few new words, maybe you're

familiar with them already. The first is Scalar, and this means a single number. And then a

vector is a single row or a single column of numbers that can be treated as a collection.

That usually means a variable.

And then finally, a matrix consists of many rows and columns.

Sort of a big rectangle of numbers, the plural of that by the way is matrices and the thing

to remember is that Machines love Matrices. Now let's take a look at a very simple example

of this. Here is a very basic representation of matrix algebra or Linear Algebra. Where

we are showing data on two people, on four variables. So over here on the left, we have

the outcomes for cases 1 and 2, our people 1 and 2. And we put it into the square brackets

to indicate that it's a vector or a matrix.

Here on the far left, it's a vector because

it's a single column of values. Next to that is a matrix, that has here on the top, the

scores for case 1, which I've written as x's. X1 is for variable 1, X2 is for variable 2

and the second subscript is indicated that it's for person 1. Below that, are the scores

for case 2, the second person. And then over here, in another vertical column are the regression

coefficients, that's a beta there that we are using. And then finally, we've got a tiny

little vector here which contains the error terms for cases 1 and 2. Now, even though

you would not do this by hand, it's helpful to run through the procedure, so I'm going

to show it to you by hand.

And we are going to take two fictional people. This will be

fictional person #1, we'll call her Sophie. We'll say that she's 28 years old and we'll

say that she's has good bargaining skills, a 4 on a scale of 5, and that she works 50

hours a week and that her salary is $118,000.00. Our second fictional person, we'll call him

Lars and we'll say that he's 34 years old and he has moderate bargaining skills 3 out

of 5, works 35 hours per week and has a salary of $84,000.00. And so if we are trying to

look at salaries, we can look at our matrix representation that we had here, with our

variables indicated with their Latin and sometimes Greek symbols. And we will replace those variables

with actual numbers. We have the salary for Sophie, our first person. So why don't we

plug in the numbers here and let's start with the result here. Sophie's salary is $118,000.00

and here's how all these numbers all add up to get that. The first thing here is the intercept.

And we just multiply that times 1, so that's sort of the starting point, and then we get

this number 10, which actually has to do with years over 18.

She's 28 so that's 10 years

over 18, we multiply each year by 1395. Next is bargaining skills. She's got a 4 out of

5 and for each step up you get $5,900.00. By the way, these are real coefficients from

study of survey of salary of data scientists. And then finally hours per week. For each

hour, you get $382.00. Now you can add these up, and get a predicted value for her but

it's a little low. It's $30,00.00 low. Which you may be saying that's pretty messed up,

well that's because there's like 40 variables in the equation including she might be the

owner and if she's the owner then yes she's going to make a lot more. And then we do a

similar thing for the second case, but what's neat about matrix algebra or Linear Algebra

is this means the same stuff and what we have here are these bolded variables. That stand

in for entire vectors or matrices. So for instance; this Y, a bold Y stands for the

vector of outcome scores. This bolded X is the entire matrix of values that each person

has on each variable. This bolded beta is all of the regression coefficients and then

this bolded epsilon is the entire vector of error terms.

And so it's a really super compact

way of representing the entire collection of data and coefficients that you use in predicting

values. So in sum, let's say this. First off, computers use matrices. They like to do linear

algebra to solve problems and is conceptually simpler because you can put it all in there

in this type formation.

In fact, it's a very compact notation and it allows you to manipulate

entire collections of numbers pretty easily. And that's that major benefit of learning

a little bit about linear or matrix algebra. Our next step in "Mathematics for Data Science

Foundations" is systems of linear equations. And maybe you are familiar with this, but

maybe you're not. And the idea here is that there are times, when you actually have many

unknowns and you're trying to solve for them all simultaneously. And what makes this really

tricky is that a lot of these are interlocked. Specifically that means X depends on Y, but

at the same time Y depends on X.

What's funny about this, is it's actually pretty easy to

solve these by hand and you can also use linear matrix algebra to do it. So let's take a little

example here of Sales. Let's imagine that you have a company and that you've sold 1,000

iPhone cases, so that they are not running around naked like they are in this picture

here. Some of them sold for $20 and others sold for $5. You made a total of $5,900.00

and so the question is "How many were sold at each price?" Now, if you were keeping our

records, but you can also calculate it from this little bit of information. And to show

you I'm going to do it by hand.

Now, we're going to start with this. We know that sales

the two price points x + y add up to 1,000 total cases sold. And for revenue, we know

that if you multiply a certain number times $20 and another number times $5, that it all

adds up to $5,900.00. Between the two of those we can figure out the rest. Let's start with

sales. Now, what I'm going to do is try to isolate the values. I am going to do that

by putting in this minus y on both sides and then I can take that and I can subtract it,

so I'm left with x is equal to 1,000 – y. Normally I solve for x, but I solve for y,

you'll see why in just a second.

Then we go to revenue. We know from earlier that our

sales at these two prices points, add up to $5,900.00 total. Now what we are going to

do is take the x that's right here and we are going to replace it with the equation

we just got, which is 1,000 – y. Then we multiply that through and we get $20,000.00 minus $20y

plus $5 y equals $5,900.00. Well, we can subtract these two because they are on the same thing.

So, $20y then we get $15y, and then we subtract $20,000.00 from both sides. So there it is,

right there on the left, and that disappears, then I get it over on the right side. And

then I do the math there, and I get minus $14, 100.00. Well, then I divide both sides

by negative $15.00 and when we do that we get y equals 940.

Okay, so that's one of our

values for sales. Let's go back to sales. We have x plus y equals 1,000. We take the

value we just got, 940, we stick that into the equation, then we can solve for x. Just

subtract 940 from each side, there we go. We get x is equal to 60. So, let's put it

all together, just to recap what happened. What this tells us is that 60 cases were sold

at $20.00 each. And that 940 cases were sold at $5 each. Now, what's interesting about

this is you can also do this graphically. We're going to draw it. So, I'm going to graph

the two equations. Here are the original ones we had. This one predicts sales, this one

gives price. The problem is, these aren't in the economical form for creating graphs.

That needs to be y equals something else, so we're going to solve both of these for

y.

We subtract x from both sides, there it is on the left, we subtract that. Then we

have y is equals to minus x plus 1,000. That's something we can graph. Then we do the same

thing for price. Let's divide by 5 all the way through, that gets rid of that and then

we've got this 4x, then let's subtract 4x from each side.

And what we are left with

is minus 4x plus 1,180, which is also something we can graph. So this first line, this indicates

cases sold. It originally said x plus y equals 1000, but we rearranged it to y is equal to

minus x plus 1000. And so that's the line we have here. And then we have another line,

which indicates earnings. And this one was originally written as $20.00 times x plus

$5.00 times y equals $5,900.00 total. We rearranged that to y equals minus 4x plus 1,180. That's

the equation for the line and then the solution is right here at the intersection. There's

our intersection and it's at 60 on the number of cases sold at $20.00 and 940 as the number

of cases sold at $5.00 and that also represents the solution of the joint equations. It's

a graphical way of solving a system of linear equations.

So in sum, systems of linear equations

allow us to balance several unknowns and find unique solutions. And in many cases, it's

easy to solve by hand, and it's really easy with linear algebra when you use software

to do it at the same time. As we continue our discussion of "Mathematics for Data Science"

and the foundational principles the next thing we want to talk about is Calculus. And I'm

going to give a little more history right here. The reason I'm showing you pictures

of stones, is because the word Calculus is Latin for stone, as in a stone used for tallying.

Where when people would actually have a bag of stones and they would use it to count sheep

or whatever.

And the system of Calculus was formalized in the 1,600s simultaneously, independently

by Isaac Newton and Gottfried Wilhelm Leibniz. And there are 3 reasons why Calculus is important

for data science. #1, it's the basis for most of the procedures we do. Things like least

squares regression and probability distributions, they use Calculus in getting those answers.

Second one is if you are studying anything that changes over time. If you are measuring

quantities or rates that change over time then you have to use Calculus. Calculus is

used in finding the maxima and minima of functions especially when you're optimizing. Which is

something I'm going to show you separately. Also, it is important to keep in mind, there

are two kinds of Calculus. The first is differential Calculus, which talks about rates of change

at a specific time. It's also known as the Calculus of change. The second kind of Calculus

is Integral Calculus and this is where you are trying to calculate the quantity of something

at a specific time, given the rate of change.

It's also known as the Calculus of Accumulation.

So, let's take a look at how this works and we're going to focus on differential Calculus.

So I'm going to graph an equation here, I'm going to do y equals x2 a very simple one

but it's a curve which makes it harder to calculate things like the slope. Let's take

a point here that's at minus 2, that's the middle of the red dot. X is equal to minus

2.

And because y is equal to x2 , if we want to get the y value, all we got to do is take

that negative 2 and square it and that gives us 4. So that's pretty easy. So the coordinates

for that red point are minus 2 on x, and plus 4 on the y. Here's a harder question. "What

is the slope of the curve at that exact point?" Well, it's actually a little tricky because

the curve is always curving there's no flat part on it. But we can get the answer by getting

the derivative of the function. Now, there are several different ways of writing this,

I am using the one that's easiest to type. And let's start by this, what we are going

to do is the n here and that is the squared part, so that we have x2 . And you see that

same n turns into the squared, and then we come over here and we put that same value

2 in right there, and we put the two in right here.

And then we can do a little bit of subtraction.

2 minus 1 is 1 and truthfully you can just ignore that then then you get 2x. That is

the derivative, so what we have here is the derivative of x2 is 2x. That means, the slope

at any given point in the curve is 2x. So, let's go back to the curve we had a moment

ago. Here's our curve, here's our point at x minus 2, and so the slope is equal to 2x,

well we put in the minus 2, and we multiply it and we get minus 4. So that is the slope

at this exact point in the curve. Okay, what if we choose a different point? Let's say

we came over here to x is equal to 3? Well, the slope is equal to 2x so that's 2 times

3, is equal to 6. Great! And on the other hand, you might be saying to yourself "And

why do I care about this?" There's a reason that this is important and what it is, is

that you can use these procedures to optimize the decisions.

And if that seems a little

to abstract to you, that means you can use them to make more money. And I'm going to

demonstrate that in the next video. But for right now in sum, let's say this. Calculus

is vital to practical data science, it's the foundation of statistics and it forms the

core that's needed for doing optimization. In our discussion about Mathematics and data

science foundations, the last thing I want to talk about right here is calculus and how

it relates to optimization.

I like to think of this, in other words, as the place where

math meets reality, or it meets Manhattan or something. Now if you remember this graph

I made in the last video, y is equal to x2, that shows this curve here and we have the

derivative that the slope can be given by 2x. And so when x is equal to 3, the slope

is equal to 6, fine. And this is where this comes into play. Calculus makes it possible

to find values that maximize or minimize outcomes. And if you want to think of something a little

more concrete here, let's think of an example, by the way that's Cupid and Psyche.

Let's

talk about pricing for online dating. Let's assume you've created a dating service and

you want to figure out how much can you charge for it that will maximize your revenue. So,

let's get a few hypothetical parameters involved. First off, let's say that subscriptions, annual

subscriptions cost $500.00 each year and you can charge that for a dating service. And

let's say you sell 180 new subscriptions every week. On the other hand, based on your previous

experience manipulating prices around, you have some data that suggests that for each

$5 you discount from the price of $500.00 you will get 3 more sales.

Also, because its

an online service, lets make our life a little more easier right now and assume there is

no increase in overhead. It's not really how it works, but we'll do it for now. And I'm

actually going to show you how to do all this by hand. Now, let's go back to price first.

We have this. $500.00 is the current annual subscription price and you're going to subtract

$5.00 for each unit of discount, that's why I'm giving D. So, one discount is $5.00, two

discounts is $10.00 and so on.

And then we have a little bit of data about sales, that

you're currently selling 180 subscriptions per week and that you will add 3 more for

each unit of discount that you give. So, what we're going to do here is we are going to

find sales as a function of price. Now, to do that the first thing we have to do is get

the y intercept. So we have price here, is $500.00, is the current annual subscription

price minus $5 times d. And what we are going to do is, is we are going to get the y intercept

by solving when does this equal zero? Okay, well we take the $500 we subtract that from

both sides and then we end up with minus $5d is equal to minus $500.00. Divide both sides

by minus $5 and we are left with d is equal to 100. That is, when d is equal to 100, x

is 0. And that tells us how we can get the y intercept, but to get that we have to substitute

this value into sales.

So we take d is equal to 100, and the intercept is equal to 180

plus 3; 180 is the number of new subscriptions per week and then we take the three and we

multiply that times our 100. So, 180 times 3 times 100,[1] is equal to 300 add those

together and you get 480. And that is the y intercept in our equation, so when we've

discounted sort of price to zero then the expected sale is 480. Of course that's not

going to happen in reality, but it's necessary for finding the slope of the line. So now

let's get the slope. The slope is equal to the change in y on the y axis divided by the

change in x.

One way we can get this is by looking at sales; we get our 180 new subscriptions

per week plus 3 for each unit of discount and we take our information on price. $500.00

a year minus $5.00 for each unit of discount and then we take the 3d and the $5d and those

will give us the slope. So it's plus 3 divided by minus 5, and that's just minus 0.6. So

that is the slope of the line. Slope is equal to minus 0.6. And so what we have from this

is sales as a function of price where sales is equal to 480 because that is the y intercept

when price is equal to zero minus 0.6 times price. So, this isn't the final thing. Now

what we have to do, we turn this into revenue, there's another stage to this.

Revenue is

equal to sales times price, how many things did you sell and how much did it cost. Well,

we can substitute some information in here. If we take sales and we put it in as a function

of price, because we just calculated that a moment ago, then we do a little bit of multiplication

and then we get that revenue is equal to 480 times the price minus 0.6 times the price.

Okay, that's a lot of stuff going on there. What we're going to do now is we're going

to get the derivative, that's the calculus that we talked about. Well, the derivative

of 480 and the price, where price is sort of the x, the derivative is simply 480 and

the minus 0.6 times price? Well, that's similar to what we did with the curve. And what we

end up with is 0.6 times 2 is equal to 1.2 times the price.

This is the derivative of

the original equation. We can solve that for zero now, and just in case you are wondering.

Why do we solve it for zero? Because that is going to give us the place when y is at

a maximum. Now we had a minus squared so we have to invert the shape. When we are trying

to look for this value right here when it's at the very tippy top of the curve, because

that will indicate maximum revenue. Okay, so what we're going to do is solve for zero.

Let's go back to our equation here.

We want to find out when is that equal to zero? Well,

we subtract 480 from each side, there we go and we divide by minus 1.2 on each side. And

this is our price for maximum revenue. So we've been charging $500.00 a week, but this

says we'll have more total income if we charge $400.00 instead. And if you want to find out

how many sales we can get, currently we have 480 and if you want to know what the sales

volume is going to be for that. Well, you take the 480 which is the hypothetical y intercept

when the price is zero, but then we put in our actual price of $400.00, multiply that,

we get 240, do the subtraction and we get 240 total. So, that would be 240 new subscriptions

per week. So let's compare this. Current revenue, is 180 new subscriptions per week at $500.00

per year. And that means our current revenue is $90,000.00 per year, I know it sounds really

good, but we can do better than that.

Because the formula for maximum value is 240 times

$400.00, when you multiply those you get $96,000.00. And so the improvement is just a ratio of

those two. $96,000.00 divided by $90,000.00 is equal to 1.07. And what that means is a

7% increase and anybody would be thrilled to get a 7% increase in their business simply

by changing the price and increasing the overall revenue. So, let's summarize what we found

here. If you lower the cost by 20%, go from $500.00 year to $400.00 per year, assuming

all of our other information is correct, then you can increase sales by 33%; that's more

than the 20 that you had and that increases total revenue by 7%. And so we can optimize

the price to get the maximum total revenue and it has to do with that little bit of calculus

and the derivative of the function. So in sum, calculus can be used to find the minima

and maxima of functions including prices. It allows for optimization and that in turn

allows you to make better business decisions. Our next topic in "Mathematics and Data Principals",

is something called Big O. And if you are wondering what Big O is all about, it is about

time.

Or, you can think of it as how long does it take to do a particular operation.

It's the speed of the operation. If you want to be really precise, the growth rate of a

function; how much more it requires as you add elements is called its Order. That's why

it's called Big O, that's for Order. And Big O gives the rate of how things grow as the

number of elements grows, and what's funny is there can be really surprising differences.

Let me show you how it works with a few different kinds of growth rates or Big O.

First off,

there's the ones that I say are sort of one the spot, you can get stuff done right away.

The simplest one is O1, and that is a constant order. That's something that takes the same

amount of time, no matter what. You can send an email out to 10,000 people just hit one

button; it's done. The number of elements, the number of people, the number of operations,

it just takes the same amount of time. Up from that is Logarithmic, where you take the

number of operations, you get the logarithm of that and you can see it's increased, but

really it's only a small increase, it tapers off really quickly. So an example is finding

an item in a sorted rate. Not a big deal. Next, one up from that, now this looks like

a big change, but in the grand scheme, it's not a big change. This is a linear function,

where each operation takes the same unit of time. So if you have 50 operations, you have

50 units of time. If you're storing 50 objects it takes 50 units of space.

So, find an item

in an unsorted list it's usually going to be linear time. Then we have the functions

where I say you know, you'd better just pack a lunch because it's going to take a while.

The best example of this is called Log Linear. You take the number of items and you multiply

that number times the log of the items. An example of this is called a fast Fourier transform,

which is used for dealing for instance with sound or anything that sort of is over time.

You can see it takes a lot longer, if you have 30 elements your way up there at the

top of this particular chart at 100 units of time, or 100 units of space or whatever

you want to put it.

And it looks like a lot. But really, that's nothing compared to the

next set where I say, you know you're just going to be camping out you may as well go

home. That includes something like the Quadratic. You square the number of elements, you see

how that kind of just shoots straight up. That's Quadratic growth. And so multiplying

two n-digit numbers, if you're multiplying two numbers that have 10 digit numbers it's

going to take you that long, it's going to take a long time.

Even more extreme is this

one, this is the exponential, two raised to the power to the number of items you have.

You'll see, by the way, the red line does not even go all the way to the top. That's

because the graphing software that I'm using, doesn't draw it when it goes above my upper

limit there, so it kind of cuts it off. But this is a really demanding kind of thing,

it's for instance finding an exact solution for what's called the Travelling Salesman

Problem, using dynamic programming. That's an example of exponential rate of growth.

And then one more I want to mention which is sort of catastrophic is Factorial. You

take the number of elements and you raise that to the exclamation point Factorial, and

you see that one cuts off very soon because it basically goes straight up. You have any

number of elements of any size, it's going to be hugely demanding. And for instance if

you're familiar with the Travelling Salesman Problem, that's trying to find the solution

through the brute force search, it takes a huge amount of time.

And you know before something

like that is done, you're probably going to turn to stone and wish you'd never even started.

The other thing to know about this, is that not only do something's take longer than others,

some of these methods and some functions are more variable than others. So for instance,

if you're working with data that you want to sort, there are different kinds of sort

or sorting methods.

So for instance, there is something called an insertion sort. And

when you find this on its best day, it's linear. It's O of n, that's not bad. On the other

hand the average is Quadratic and that's a huge difference between the two. Selection

sorts on the other hand, the best is quadratic and the average is quadratic. It's always

consistent, so it's kind of funny, it takes a long time, but at least you know how long

it's going to take versus the variability of something like an insertion sort. So in

sum, let me say a few things about Big O.

#1, You need to know that certain functions

or procedures vary in speed, and the same thing applies to making demands on a computer's

memory or storage space or whatever. They vary in their demands. Also, some are inconsistent.

Some are really efficient sometimes and really slow or difficult the others. Probably the

most important thing here is to be aware of the demands of what you are doing. That you

can't, for instance, run through every single possible solution or you know, your company

will be dead before you get an answer. So be mindful of that so you can use your time

well and get the insight you need, in the time that you need it. A really important

element of the "Mathematics and Data Science" and one of its foundational principles is

Probability.

Now, one of the things that Probability comes in intuitively for a lot of people is

something like rolling dice or looking at sports outcomes. And really the fundamental

question of what are the odds of something. That gets at the heart of Probability. Now

let's take a look at some of the basic principles. We've got our friend, Albert Einstein here

to explain things. The Principles of Probability work this way. Probabilities range from zero

to 1, that's like zero percent to one hundred percent chance. When you put P, then in parenthesis

here A, that means the Probability of whatever is in parenthesis.

So P(A), means the Probability

of A. and then P(B) is the Probability of B. When you take all of the probabilities

together, you get what is called the probability Space. And that's why we have S and that all

adds up to 1, because you've now covered 100 % of the possibilities. Also you can talk

about the compliment. The tilde here is used to say the probability of not A is equal to

1 minus the probability of A, because those have to add up. So, let's take a look at something

also that conditional probabilities, which is really important in statistics. A conditional

probability is the probability that something if something else is true. You write it this

way: the probability of, and that vertical line is called a Pipe and it's read as assuming

that or given that. So you can read this as the probability of A given B, is the probability

of A occurring if B is true. So you can say for instance, what's the probability if something's

orange, what's the probability that it's a caret given this picture.

Now, the place that

this comes in really important for a lot of people is the probability of type one and

type two errors in hypothesis testing, which we'll mention at some other point. But I do

want to say something about arithmetic with probabilities because it does not always work

out the way people think it will. Let's start by talking about adding probabilities. Let's

say you have two events A and B, and let's say you want to find the probabilities of

either one of those events. So that's like adding the probabilities of the two events.

Well, it's kind of easy. You take the probability of event A and you add the probability of

event B, however you may have to subtract something, you may have to subtract this little

piece because maybe there are some overlap between the two of them. On the other hand

if A and B are disjoined, meaning they never occur together, then that's equal to zero.

And then you can subtract zero which is just, you get back to the original probabilities.

Let's take a really easy example of this.

I've created my super simple sample space

I have 10 shapes. I have 5 squares on top, 5 circles on the bottom and I've got a couple

of red shapes on the right side. Let's say we want to find the probability of a square

or a red shape. So we are adding the probabilities but we have to adjust for the overlap between

the two. Well here's our squares on top. 5 out of the 10 are squares and over here on

the right we have two red shapes, two out of 10. Let's go back to our formula here and

let's change a little bit. Change the A and the B to S and R for square and red. Now we

can start this way, let's get the probability that something is a square. Well, we go back

to our probability space and you see we have 5 squares out of 10 shapes total. So we do

5 over 10, that reduces to .5. Okay, next up the probability of something red in our

sample space. Well, we have 10 shapes total, two of them on the far right are red. That's

two over 10, and you do the division get.2.

Now, the trick is the overlap between these

two categories, do we have anything that is both square and red, because we don't want

to count that twice we have to subtract it. Let's go back to our sample space and we are

looking for something that is square, there's the squares on top and there's the things

that are red on the side. And you see they overlap and this is our little overlapping

square. So there's one shape that meets both of those, one out of 10. So we come back here,

one out of 10, that reduces to .1 and then we just do the addition and subtraction here.

.5 plus .2 minus .1, gets us .6. And so what that means is, there is a 60% chance of an

object being square or red. And you can look at it right here. We have 6 shapes outlined

now and so that's the visual interpretation that lines up with the mathematical one we

just did. Now let's talk about multiplication for Probabilities. Now the idea here is you

want to get joint probabilities, so the probability of two things occurring together, simultaneously.

And what you need to do here, is you need to multiply the probabilities.

And we can

say the probability of A and B, because we are asking about A and B occurring together,

a joint occurrence. And that's equal to the probability of A times the probability of

B, that's easy. But you do have to expand it just a little bit because you can have

the problem of things overlapping a little bit, and so you actually need to expand it

to a conditional probability, the probability of B given A. Again, that's that vertical

pipe there. On the other hand, if A and B are independent and they never co-occur, or

B is no more likely to occur if A happens, then it just reduces to the probability of

B, then you get your slightly simpler equation. But let's go and take a look at our sample

space here. So we've got our 10 shapes, 5 of each kind, and then two that are red. And

we are going to look at originally, the probability of something being square or red, now we are

going to look at the probability of it being square and red. Now, I know we can eyeball

this one real easy, but let's run through the math.

The first thing we need to do, is

get the ones that are square. There's those 5 on the top and the ones that are red, and

there's those two on the right. In terms of the ones that are both square and red, yes

obviously there's just this one red square at the top right. But let's do the numbers

here.

We change our formula to be S and R for square and red, we get the probability

of square. Again that's those 5 out of 10, so we do 5/10, reduce this to .5. And then

we need the probability of red given that it's a square. So, we only need to look at

the squares here. There's the squares, 5 of them, and one of them is red. So that's 1

over 5 . That reduces to .2. You multiply those two numbers; .5 times .2, and what you

get is .10 or 10% chance or 10 percent of our total sample space is red squares. And

you come back and you look at it and you say yeah there's one out of 10. So, that just

confirms what we are able to do intuitively.

So, that's our short presentation on probabilities

and in sum what did we get out of that? #1, Probability is not always intuitive. And also

the idea that conditional values can help in a lot of situations, but they may not work

the way you expect them to. And really the arithmetic of Probability can surprise people

so pay attention when you are working with it so you can get a more accurate conclusion

in your own calculations. Let's finish our discussion of "Mathematics and Data Science"

and the basic principles by looking at something called Bayes' theorem. And if you're familiar

with regular probability and influential testing, you can think of Bayes' theorem as the flip

side of the coin.

You can also think of it in terms of intersections. So for instance,

standard inferential tests and calculations give you the probability of the data; that's

our d, given the hypothesis. So, if you assume a known hypothesis is true, this will give

you the probability of the data arising by chance. The trick is, most people actually

want the opposite of that. They want the probability of the hypothesis given the data. And unfortunately,

those two things can be very different in many circumstances. On the other hand, there's

a way of dealing with it, Bayes does it and this is our guy right here. Reverend Thomas

Bayes, 18th Century English minister and statistician. He developed a method for getting what he

called posterior probabilities that use as prior probabilities. And test information

or something like base rates, how common something overall to get the posterior or after the

fact Probability. Here's the general recipe to how this works: You start with the probability

of the data given the hypothesis which is what you get from the likelihood of the data.

You also get that from a standard inferential test.

To that, you need to add the probability

to the hypothesis or the cause of being true. That's called the prior or the prior probability.

To that you add the D; the probability of the data, that's called the marginal probability.

And then you combine those and in a special way to get the probability of the hypothesis

given the data or the posterior probability. Now, if you want to write it as an equation,

you can write it in words like this; posterior is equal to likelihood times prior divided

by marginal. You can also write it in symbols like this; the probability of H given D, the

probability of the hypothesis given the data, that's the posterior probability. Is equal

to the probability of the data given the hypothesis, that the likelihood, multiplied by the probability

of the hypothesis and divided by probability of the data overall. But this is a lot easier

if we look at a visual version of it. So, let's go this example here. Let's say we have

a square here that represents 100% of all people and we are looking at a medical condition.

And what we are going to say here is that we got this group up here that represents

people who have a disease, so that's a portion of all people.

And that what we say, is we

have a test and people with the disease, 90% of them will test positive, so they're marked

in red. Now it does mean over here on the far left people with the disease who test

negative that's 10%. Those are our false negatives. And so if the test catches 90% of the people

who have the disease, that's good right? Well, let's look at it this way. Let me ask y0u

a basic question. "If a person tests positive for a disease, then what is the probability

they really have the disease?" And if you want a hint, I'm going to give you one.

It's

not 90%,. Here's how it goes. So this is the information I gave you before and we've got

90% of the people who have the disease; that's a conditional probability, they test positive.

But what about the other people, the people in the big white area below, ‘of all people'.

We need to look at them and if any of them ever test positive, do we ever get false positives

and with any test you are going to get false positives. And so let's say our people without

the disease, 90% of them test negative, the way they should. But of the people who don't

have the disease, 10% of them test positive, those are false positives. And so if you really

want to answer the question, "If you test positive do you have the disease?", here's

what you need. What you need is the number of people with the disease who test positive

divided by all people who test positive. Let's look at it this way. So here's our information.

We've got 29.7% of all people are in this darker red box, those are the people who have

the disease and test positive, alright that's good.

Then we have 6.7% of the entire group,

that's the people without the disease who test positive. So we want to do, we want the

probability of the disease what percentage have the disease and test positive and then

divide that by all the people that test positive. And that bottom part is made up of two things.

That's made up of the people who have the disease and test positive, and the people

who don't have the disease and test positive. Now we can take our numbers and start plugging

them in. Those who have the disease and test positive that's 29.7% of the total population

of everybody. We can also put that number right here. That's fine, but we also need

to look at the percentage that do not have the disease and test positive; of the total

population, that's 6.7%. So, we just need to rearrange, we add those two numbers on

the bottom, we get 36.4% and we do a little bit of division. And the number we get is

81.6%, here's what that means. A positive test result still only means a probability

of 81.6% of having the disease.

So, the test is advertised at having 90% accuracy, well

if you test positive there's really only a 82% chance you have the disease. Now that's

not really a big difference. But consider this: what if the numbers change? For instance,

what if the probability of the disease changes? Here's what we originally had. Let's move

it around a little bit. Let's make the disease much less common. And so now what we do, we

are going to have 4.5% of all people are people who have the disease and test positive. And

then because there is a larger number of people who don't have the disease, we are going to

have a relatively larger proportion of false positives. Again, compared to the entire population

it's going to be 9.5% of everybody. So we are going to go back to our formula here in

words and start plugging in the numbers. We get 4.5% right there, and right there. And

then we add in our other number, the false positives that's 9.5%. Well, we rearrange

and we start adding things up, that's 14% and when we divide that, we get 32.1%. Here's

what that number means.

That means a positive test result; you get a positive test result,

now means you only have a probability of 32.1% of having the disease. That's ? less than

the accuracy of 90%, and in case you can't tell, that's a really big difference. And

that's why Bayes theorem matters, because it answers the questions that people want

and the answer can be dramatically different depending on the base rate of the thing you

are talking about. And so in sum, we can say this. Bayes theorem allows you to answer the

right question, people really want to know; what's the probability that I have the disease.

What's the probability of getting a positive if I have the disease. They want to know whether

they have the disease. And to do this, you need to have prior probabilities, you need

to know how common the disease is, you need to know how many people get positive test

results overall.

But, if you can get that information and run them through it can change

your answers and really the emotional significance of what you're dealing with dramatically.

Let's wrap up some of our discussion of "Mathematics and Data Science" and the data principles

and talk about some of the next steps. Things you can do afterwards. Probably the most important

thing is, you may have learned about math a long time ago but now it's a good time to

dig out some of those books and go over some of the principles you've used before. The

idea here is that a little math can go a long way in data science. So, things like Algebra

and things like Calculus and things like Big O and Probability. All of those are important

in data science and its helpful to have at least a working understanding of each. You

don't have to know everything, but you do need to understand the principles of your

procedures that you select when you do your projects.

There are two reasons for that very

generally speaking. First, you need to know if a procedure will actually answer your question.

Does it give you the outcome that you need? Will it give you the insight that you need?

Second; really critical, you need to know what to do when things go wrong. Things don't

always work out, numbers don't always add up, you got impossible results or things just

aren't responding. You need to know enough about the procedure and enough about the mathematics

behind it, so you can diagnose the problem, and respond appropriately. And to repeat myself

once again, no matter what you're working on in data science, no matter what tool you're

using, what procedure you're doing, focus on your goal. And in case you can't remember

that, your goal is meaning. Your goal is always meaning. Welcome to "Statistics in Data Science".

I'm Barton Poulson and what we are going to be doing in this course is talking about some

of the ways you can use statistics to see the unseen. To infer what's there, even when

most of it's hidden.

Now this shouldn't be surprised. If you remember the data science

Venn Diagram we talked about a while ago, we have math up here at the top right corner,

but if you were to go to the original description of this Venn Diagram, it's full name was math

and stats. And let me just mention something in case it's not completely obvious about

why statistics matters to data science. And the idea is this; counting is easy. It's easy

to say how many times a word appears in a document, it's easy to say how many people

voted for a particular candidate in one part of the country.

Counting is easy, but summarizing

and generalizing those things hard. And part of the problem is there's no such thing as

a definitive analysis. All analyses really, depend on the purposes that you're dealing

with. So as an example, let me give you a couple of pairs of words and try to summarize

the difference between them in just two or three words. In a word or two, how is a souffle

different from a quiche, or how is an Aspen different from a Pine tree? Or how is Baseball

different from Cricket? And how are musicals different from opera? It really depends on

who you are talking to, it depends on your goals and it depends on the shared knowledge.

And so, there's not a single definitive answer, and then there's the matter of generalization.

Think about it again, take music. Listen to three concerti by Antonio Vivaldi, and do

you think you can safely and accurately describe all of his music? Now, I actually chose Vivaldi

on purpose because even Igor Stravinsky said you could, he said he didn't write 500 concertos

he wrote the same concerto 500 times.

But, take something more real world like politics.

If you talk to 400 registered voters in the US, can you then accurately predict the behavior

of all of the voters? There's about 100 million voters in the US, and that's a matter of generalization.

That's the sort of thing we try to take care of with inferential statistics. Now there

are different methods that you can use in statistics and all of them are described to

give you a map; a description of the data you're working on. There are descriptive statistics,

there are inferential statistics, there's the inferential procedure Hypothesis testing

and there's also estimation and I'll talk about each of those in more depth.

There are

a lot of choices that have to be made and some of the things I'm going to discuss in

detail are for instance the choice of Estimators, that's different from estimation. Different

measures of fit. Feature selection, for knowing which variables are the most important in

predicting your outcome. Also common problems that arise when trying to model data and the

principles of model validation. But through this all, the most important thing to remember

is that analysis is functional. It's designed to serve a particular purpose. And there's

a very wonderful quote within the statistics world that says all models are wrong. All

statistical descriptions of reality are wrong, because they are not exact depictions, they

are summaries but some are useful and that's from George Box. And so the question is, you're

not trying to be totally, completely accurate, because in that case you just wouldn't do

an analysis.

The real question is, are you better off not doing your analysis than not

doing it? And truthfully, I bet you are. So in sum, we can say three things: #1, you want

to use statistics to both summarize your data and to generalize from one group to another

if you can. On the other hand, there is no "one true answer" with data, you got to be

flexible in terms of what your goals are and the shared knowledge. And no matter what your

doing, the utility of your analysis should guide you in your decisions.

The first thing

we want to cover in "Statistics in Data Science" is the principles of exploring data and this

video is just designed to give an exploration overview. So we like to think of it like this,

the intrepid explorers, they're out there exploring and seeing what's in the world.

You can see what's in your data, more specifically you want to see what your dataset is like.

You want to see if your assumptions are right so you can do a valid analysis with your procedure.

Something that may sound very weird, but you want to listen to your data. Something's not

work out, if it's not going the way you want, then you're going to have to pay attention

and exploratory data analysis is going to help you do that. Now, there are two general

approaches to this. First off, there's a graphical exploration, so you use graphs and pictures

and visualizations to explore your data.

The reason you want to do this is that graphics

are very dense in information. They're also really good, in fact the best to get the overall

impression of your data. Second to that, there is numerical exploration. I make it very clear,

this is the second step. Do the visualization first, then do the numerical part. Now you

want to do this, because this can give greater precision, this is also an opportunity to

try variations on the data. You can actually do some transformations, move things around

a little bit and try different methods and see how that effects the results, see how

it looks. So, let's go first to the graphical part. They are very quick and simple plots

that you can do.

Those include things like bar charts, histograms and scatterplots, very

easy to make and a very quick way to getting to understand the variables in your dataset.

In terms of numerical analysis; again after the graphical method, you can do things like

transform the data, that is take like the logarithm of your numbers. You can do Empirical

estimates of population numbers, and you can use robust methods. And I'll talk about all

of those at length in later videos. But for right now, I can sum it up this way.

The purpose

of exploration is to help you get to know your data. And also you want to explore your

data thoroughly before you start modelling, before you build statistical models. And all

the way through you want to make sure you listen carefully so that you can find hidden

or unassumed details and leads in your data. As we move in our discussion of "Statistics

and Exploring Data", the single most important thing we can do is Exploratory Graphics. In

the words of the late great Yankees catcher Yogi Berra, "You can see a lot by just looking".

And that applies to data as much as it applies to baseball.

Now, there's a few reasons you

want to start with graphics. #1, is to actually get a feel for the data. I mean, what's it

distributed like, what's the shape, are there strange things going on. Also it allows you

to check the assumptions and see how well your data match the requirements of the analytical

procedures you hope to use. You can check for anomalies like outliers and unusual distributions

and errors and also you can get suggestions. If something unusual is happening in the data,

that might be a clue that you need to pursue a different angle or do a deeper analysis.

Now we want to do graphics first for a couple of reasons.

#1, is they are very information

dense, and fundamentally humans are visual. It's our single, highest bandwidth way of

getting information. It's also the best way to check for shape and gaps and outliers.

There's a few ways that you can do this if you want to and the first is with programs

that rely on code. So you can use the statistical programming language R, the general purpose

language Python. You can actually do a huge amount in JavaScript, especially D3JS. Or

you can use Apps, that are specifically designed for exploratory analysis, that includes Tableau

both the desktop and public versions, Qlik and even Excel is a good way to do this. And

finally you can do this by hand. John Tukey who's the father of Exploratory Data Analysis,

wrote his seminal book, a wonderful book where it's all hand graphics and actually it's a

wonderful way to do it.

But let's start the process for doing these graphics. We start

with one variable. That is univariate distributions. And so you'll get something like this, the

fundamental chart is the bar chart. This is when you are dealing with categories and you

are simply counting however many cases there are in each category. The nice thing about

bar charts is they are really easy to read. Put them in descending order and may be have

them vertical, maybe have them horizontal. Horizontal could be nice to make the labels

a little easier to read. This is about psychological profiles of the United States, this is real

data.

We have most states in the friendly and conventional, a smaller amount in the

temperamental and uninhibited and the least common of the United States is relaxed and

creative. Next you can do a Box plot, or sometimes called a box and whiskers plot. This is when

you have a quantitative variable, something that's measured and you can say how far apart

scores are. A box plot shows quartile values, it also shows outliers. So for instance this

is google searches for modern dance. That's Utah at 5 standard deviations above the national

average. That's where I'm from and I'm glad to see that there.

Also, it's a nice way to

show many variables side by side, if they are on proximately similar scales. Next, if

you have quantitative variables, you are going to want to do a histogram. Again, quantitative

so interval or ratio level, or measured variables. And these let you see the shape of a distribution

and potentially compare many. So, here are three histograms of google searches on Data

Science, and Entrepreneur and Modern Dance.

And you can see, mostly for the part normally

distributed with a couple of outliers. Once you've done one variable, or the univariate

analyses, you're going to want to do two variables at a time. That is bivariate distributions

or joint distributions. Now, one easy way to do this is with grouped plots. You can

do grouped bar charts and box plots. What I have here is grouped box plots. I have my

three regions, Psychological Regions of the United States and I'm showing how they rank

on openness that's a psychological characteristic. As you can see, the relaxed and creative are

high and the friendly conventional tend to go to the lowest and that's kind of how that

works. It's also a good way of seeing the association between a categorical variable

like region of the United States psychologically, and a quantitative outcome, which is what

we have here with openness. Next, you can also do a Scatterplot. That's where you have

quantitative variables and what you're looking for here is, is it a straight line? Is it

linear? Do we have outliers? And also the strength of association.

How closely do the

dots all come to the regression line that we have here in the middle. And this is an

interesting one for me because we have openness across the bottom, so more open as you go

to the right and agreeableness. And what you can see is there is a strong downhill association.

The states and the states that are the most open are also the least agreeable, so we're

going to have to do something about that. And then finally, you're going to want to

go to many variables, that is multivariate distributions.

Now, one big question here

is 3D or not 3D? Let me make an argument for not 3D. So, what I have here is a 3D Scatterplot

about 3 variables from Google searches. Up the left, I have FIFA which is for professional

soccer. Down there on the bottom left, I have searches for the NFL and on the right I have

searches for NBA. Now, I did this in R and what's neat about this is you can click and

drag and move it around. And you know that's kind of fun, you kind of spin around and it

gets kind of nauseating as you look at it. And this particular version, I'm using plotly

in R, allows you to actually click on a point and see, let me see if I can get the floor

in the right place. You can click on a point and see where it ranks on each of these characteristics.

You can see however, this thing is hard to control and once it stops moving, it's not

much fun and truthfully most 3D plots I've worked with are just kind of nightmares.

They

seem like they're a good idea, but not really. So, here's the deal. 3D graphics, like the

one I just showed you, because they are actually being shown in 2D, they have to be in motion

for you to tell what is going on at all. And fundamentally they are hard to read and confusing.

Now it's true, they might be useful for finding clusters in 3 dimensions, we didn't see that

in the data we had, but generally I just avoid them like the plague. What you do want to

do however, is see the connection between the variables, you might want to use a matrix

of plots. This is where you have for instance many quantitative variables, you can use markers

for group membership if you want, and I find it to be much clearer than 3D. So here, I

have the relationship between 4 search terms: NBA, NFL, MLB for Major League Baseball and

FIFA. You can see the individual distributions, you can see the scatterplots, you can get

the correlation.

Truthfully for me this is a much easier chart to read and you can get

the richness that we need, from a multidimensional display. So the questions you're trying to

answer overall are: Number 1, Do you have what you need? Do you have the variables that

you need, do you have the ability that you need? Are there clumps or gaps in the distributions?

Are there exceptional cases/anomalies that are really far out from everybody else, spikes

in the scores? And of course are there errors in the data? Are there mistakes in coding,

did people forget to answer questions? Are there impossible combinations? And these kinds

of things are easiest to see with a visualization that really kind of puts it there in front

of you.

And so in sum, I can say this about graphical exploration of data. It's a critical

first step, it's basically where you always want to start. And you want to use the quick

and easy methods, again. Bar charts, scatter plots are really easy to make and they're

very easy to understand. And once you're done with the graphical exploration, then you can

go to the second step, which is exploring the data through numbers. The next step in

"Statistics and Exploring Data" is exploratory statistics or numerical exploration of data.

I like to think of this, as go in order.

First, you do visualization, then you do the numerical

part. And a couple of things to remember here. #1, you are still exploring the data. You're

not modeling yet, but you are doing a quantitative exploration. This might be an opportunity

to get empirical estimates, that is of population parameters as opposed to theoretically based

ones. It's a good time to manipulate the data and explore the effect of manipulating the

data, looking at subgroups, looking at transforming variables.

Also, it's an opportunity to check

the sensitivity of your results. Do you get the same general results if you test under

different circumstances. So we are going to talk about things like Robust Statistics,

resampling data and transforming data. So, we'll start with Robust Statistics. This by

the way is Hercules, a Robust mythical character. And the idea with robust statistics is that

they are stable, is that even when the data varies in unpredictable ways you still get

the same general impression. This is a class of statistics, it's an entire category, that's

less affected by outliers, and skewness, kurtosis and other abnormalities in the data.

So let's

take a quick look. This is a very skewed distribution that I created. The median, which is the dark

line in the box, is right around one. And I am going to look at two different kinds

of robust statistics, The Trimmed Mean and the Winsorized Mean. With the Trimmed mean,

you take a certain percentage of data from the top and the bottom and you just throw

it away and compute for the rest. With the Winsorized, you take those and you move those

scores into the highest non-outlier score. Now the 0% is exactly the same as the regular

mean and here it's 1.24, but as we trim off or move in 5%, the mean shifts a little bit.

Then 10 % it comes in a little bit more to 25%, now we are throwing away 50% of our data.

25% on the top and 25% on the bottom.

And we get a trimmed mean of 1.03 and a winsorized

of 1.07. When we throw away 50% or we trim 50%, that actually means we are leaving just

the median, only the middle scores left. Then we get 1.01. What's interesting is how close

we get to that, even when we have 50% of the data left, and so that's an interesting example

of how you can use robust statistics to explore data, even when you have things like strong

skewness. Next is the principle of resampling. And that's like pulling marbles repeatedly

from the jar, counting the colors, putting them back in and trying again. That's an empirical

estimate of sampling variability. So, sometimes you get 20% red marbles, sometimes you get

30, sometimes you get 22 and so on.

There are several versions for this, they go by

the name jackknife, the bootstrap the permutation. And the basic principle of resampling is also

key to the process of cross-validation, I'll have more to say about validation later. And

then finally there's transforming variables. Here's our caterpillars in the process of

transforming into butterflies. But the idea here, is that you take a difficult data set

and then you do what's called a smooth function.

There's no jumps in it, and something that

allows you to preserve the order and work on the full dataset. So you can fix skewed

data, and in a scatter plot you might have a curved line, you can fix that. And probably

the best way to look at this is probably with something called Tukey's ladder of powers.

I mentioned before John Tukey, the father of exploratory data analysis. He talked a

lot about data transformations. This is his ladder, starting at the bottom with the -1,

over x2, up to the top with x3. Here's how it works, this distribution over here is a

symmetrical normally distributed variable, and as you start to move in one direction

and you apply the transformation, take the square root you see how it moves the distribution

over to one end. Then the logarithm, then you get to the end then you get to this minus

1 over the square of the score. And that pushes it way way, way over.

If you go the other

direction, for instance you square the score, it pushes it down in the one direction and

then you cube it and then you see how it can move it around in ways that allow you to,

you can actually undo the skewness to get back to a more centrally distributed distribution.

And so these are some of the approaches that you can use in the numerical distribution

of data. In sum, let's say this: statistical or numerical exploration allows you to get

multiple perspectives on your data. It also allows you to check the stability, see how

it works with outliers, and skewness and mixed distributions and so on.

And perhaps most

important it sets the stage for the statistical modelling of your data. As a final step of

"Statistics and Exploring Data", I'm going to talk about something that's not usually

exploring data but it is basic descriptive statistics. I like to think of it this way.

You've got some data, and you are trying to tell a story. More specifically, you're trying

to tell your data's story. And with descriptive statistics, you can think of it as trying

to use a little data to stand in for a lot of data. Using a few numbers to stand in for

a large collection of numbers. And this is consistent with the advice we get from good

ole Henry David Thoreau, who told us Simplify, Simplify.

If you can tell your story with

more carefully chosen and more informative data, go for it. So there's a few different

procedures for doing this. #1, you'll want to describe the center of your distribution

of data, that is if you're going to choose a single number, use that. # 2, if you can

give a second number give something about the spread or the dispersion of the variability.

And #3, give something about the shape of the distribution. Let me say more about each

of these in turn. First, let's talk about center. We have the center of our rings here.

Now there are a few very common measure of center or location or central tendency of

a distribution. There's the mode, the median and there's the mean. Now, there are many,

many others but those are the ones that are going to get you most of the way.

Let's talk

about the mode first. Now, I'm going to create a little dataset here on a scale from 1 to

11, and I'm going to put individual scores. There's a one, and another one, and another

one and another one. Then we have a two, two, then we have a score way over at 9 and another

score over at 11. So we have 8 scores, and this is the distribution. This is actually

a histogram of the dataset. The mode is the most commonly occurring score or the most

frequent score. Well, if you look at how tall each of these go, we have more ones than anything

else, and so one is the mode. Because it occurs 4 times and nothing else comes close to that.

The median is a little different. The median is looking for the score that is at the center

if you split it into two equal groups.

We have 8 scores, so we have to get one group

of 4, that's down here, and the other group of four, this really big one because it's

way out and the median is going to be the place on the number line that splits those

into two groups. That's going to be right here at one and a half. Now the mean is going

to be a little more complicated, even though people understand means in general. It's the

first one here that actually has a formula, where M for the mean is equal to the sum of

X (that's our scores on the variable), divided by N (the number of scores). You can also

write it out with Greek notation if you want, like this where that's sigma – a capital sigma

is the summation sign, sum of X divided by N.

And with our little dataset, that works

out to this: one plus one plus one plus one plus two plus two plus nine plus eleven. Add

those all up and divide by 8, because that's how many scores there are. Well that reduces

to 28 divided by 8, which is equal to 3.5. If you go back to our little chart here, 3.5

is right over here. You'll notice there aren't any scores really exactly right there. That's

because the mean tends to get very distorted by its outliers, it follows the extreme scores.

But a really nice, I say it's more than just a visual analogy, is that if this number were

a sea saw, then the mean is exactly where the balance point or the fulcrum would be

for these to be equal. People understand that. If somebody weighs more they got to sit in

closer to balance someone who less, who has to sit further out, and that's how the mean

works. Now, let me give a bit of the pros and cons of each of these. Mode is easy to

do, you just count how common it is.

On the other hand, it may not be close to what appears

to be the center of the data. The Median it splits the data into two same size groups,

the same number of scores in each and that's pretty easy to deal with but unfortunately,

it's pretty hard to use that information in any statistics after that. And finally the

mean, of these three it's the least intuitive, it's the most effective by outliers and skewness

and that really may strike against it, but it is the most useful statistically and so

it's the one that gets used most often.

Next, there's the issue of spread, spread your tail

feathers. And we have a few measures here that are pretty common also. There's the range,

there are percentiles and interquartile range and there's variance and standard deviation.

I'll talk about each of those. First the Range. The Range is simply the maximum score minus

the minimum score, and in our case that's 11 minus 1, which is equal to 10, so we have

a range of 10. I can show you that on our chart. It's just that line on the bottom from

the 11 down to the one. That's a range of 10. The interquartile range which is actually

usually referred to simply as the IQR is the distance between the Q3; which is the third

quartile score and Q1; which is the first quartile score. If you're not familiar with

quartiles, it's the same the 75th percentile score and the 25th percentile score.

Really

what it is, is you're going to throw away some of the some of the data. So let's go

to our distribution here. First thing we are going to do, we are going to throw away the

two highest scores, there they are, they're greyed out now, and then we are going to throw

away two of the lowest scores, they're out there. Then we are going to get the range

for the remaining ones.

Now, this is complicated by the fact that I have this big gap between

2 and 9, and different methods of calculating quartiles do something with that gap. So if

you use a spreadsheet it's actually going to do an interpolation process and it will

give you a value of 3.75, I believe. And then down to one for the first quartile, so not

so intuitive with this graph but that it is how it works usually. If you want to write

it out, you can do it like this. The interquartile range is equal to Q3 minus Q1, and in our

particular case that's 3.75 minus 1. And that of course is equal to just 2.75 and there

you have it. Now our final measure of spread or variability or dispersion, is two related

measures, the variance and the standard deviation.

These are little harder to explain and a little

harder to show. But the variance, which is at least the easiest formula, is this: the

variance is equal to that's the sum, the capital sigma that's the sum, X minus M; that's how

far each score is from the mean and then you take that deviation there and you square it,

you add up all the deviations, and then you divide by the number.

So the variance is,

the average square deviation from the mean. I'll try to show you that graphically. So

here's our dataset and there's our mean right there at 3 and a half. Let's go to one of

these twos. We have a deviation there of 1.5 and if we make a square, that's 1.5 points

on each side, well there it is. We can do a similar square for the other score too.

If we are going down to one, then it's going to be 2.5 squared and it's going to be that

much bigger, and we can draw one of these squares for each one of our 8 points.

The

squares for the scores at 9 and 11 are going to be huge and go off the page, so I'm not

going to show them. But once you have all those squares you add up the area and you

get the variance. So, this is the formula for the variance, but now let me show the

standard deviation which is also a very common measure. It's closely related to this, specifically

it's just the square root of the variance. Now, there's a catch here. The formulas for

the variance and the standard deviation are slightly different for populations and samples

in that they use different denominators. But they give similar answers, not identical but

similar if the sample is reasonably large, say over 30 or 50, then it's really going

to be just a negligible difference. So let's do a little pro and con of these three things.

First, the Range. It's very easy to do, it only uses two numbers the high and the low,

but it's determined entirely by those two numbers.

And if they're outliers, then you've

got really a bad situation. The Interquartile Range the IQR, is really good for skewed data

and that's because it ignores extremes on either end, so that's nice. And the variance

and the standard deviation while they are the least intuitive and they are the most

affected by outliers, they are also generally the most useful because they feed into so

many other procedures that are used in data science. Finally, let's talk a little bit

about the shape of the distribution. You can have symmetrical or skew distribution, unimodal,

uniform or u-shaped. You can have outliers, there's a lot of variations. Let me show you

a few of them. First off is a symmetrical distribution, pretty easy. They're the same

on the left and on the right. And this little pyramid shape is an example of a symmetrical

distribution. There are also skewed distributions, where most of the scores are on one end and

they taper off.

This here is a positively skewed distribution where most of the scores

are at the low end and the outliers are on the high end. This is unimodal, our same pyramid

shape. Unimodal means it has one mode, really kind of one hump in the data. That's contrasted

for instance to bimodal where you have two modes, and that usually happens when you have

two distributions that got mixed together.

There is also uniform distribution where every

response is equally common, there's u-shaped distributions where people tend to pile up

at one end or the other and a big dip in the middle. And so there's a lot of different

variations, and you want to get those, the shape of the distribution to help you understand

and put the numerical summaries like the mean and like the standard deviation and put those

into context. In sum, we can say this: when you use this script of statistics that allows

you to be concise with your data, tell the story and tell it succinctly. You want to

focus on things like the center of the data, the spread of the data, the shape of the data.

And above all, watch out for anomalies, because they can exercise really undue influence on

your interpretations but this will help you better understand your data and prepare you

for the steps to follow.

As we discuss "Statistics in Data Science", one of the really big topics

is going to be Inference. And I'll begin that with just a general discussion of inferential

statistics. But, I'd like to begin unusually with a joke, you may have seen this before

it says "There are two kinds of people in the world.

1) Those you can extrapolate from

incomplete data and, the end". Of course, because the other group is the people who

can't. But let's talk about extrapolating from incomplete data or inferring from incomplete

data. First thing you need to know is the difference between populations and samples.

A population represents all of the data, or every possible case in your group of interest.

It might be everybody who's a commercial pilot, it might be whatever. But it represents everybody

in that or every case in that group that you're interested in. And the thing with the population

is, it just is what it is. It has its values, it has it's mean and standard deviation and

you are trying to figure out what those are, because you generally use those in doing your

analyses. On the other hand, samples instead of being all of the data are just some of

the data. And the trick is they are sampled with error. You sample one group and you calculate

the mean. It's not going to be the same if you do it the second time, and it's that variability

that's in sampling that makes Inference a little tricky.

Now, also in inference there

are two very general approaches. There's testing which is short for hypothesis testing and

maybe you've had some experience with this. This is where you assume a null hypothesis

of no effect is true. You get your data and you calculate the probability of getting the

sample data that you have if the null hypothesis is true. And if that value is small, usually

less than 5%, then you reject the null hypothesis which says really nothings happen and you

infer that there is a difference in the population. The other most common version is Estimation.

Which for instance is characterizing confidence intervals. That's not the only version of

Estimation but it's the most common. And this is where you sample data to estimate a population

parameter value directly, so you use the sample mean to try to infer what the population mean

is.

You have to choose a confidence level, you have to calculate your values and you

get high and low bounds for you estimate that work with a certain level of confidence. Now,

what makes both of these tricky is the basic concept of sampling error. I have a colleague

who demonstrates this with colored M&M's, what percentage are red, and you get them

out of the bags and you count. Now, let's talk about this, a population of numbers.

I'm going to give you just a hypothetical population of the numbers 1 through 10. And

what I am going to do, is I am going to sample from those numbers randomly, with replacement.

That means I pull a number out, it might be a one and I put it back, I might get the one

again. So I'm going to sample with replacement, which actually may sound a little bit weird,

but it's really helpful for the mathematics behind inference. And here are the samples

that I got, I actually did this with software.

I got a 3, 1, 5, and 7. Interestingly, that

is almost all odd numbers, almost. My second sample is 4, 4, 3, 6 and 10. So you can see

I got the 4 twice. And I didn't get the 1, the 2, the 5, 7, or 8 or 9. The third sample

I got three 1's! And a 10 and a 9, so we are way at the ends there. And then my fourth

sample, I got a 3, 9, 2, 6, 5. All of these were drawn at random from the exact same population,

but you see that the samples are very different. That's the sampling variability or the sampling

error. And that's what makes inference a little trickier. And let's just say again, why the

sampling variability, why it matters. It's because inferential methods like testing and

like estimation try to see past the random sampling variation to get a clear picture

on the underlying population. So in sum, let's say this about Inferential Statistics. You

sample your data from the larger populations, and as you try to interpret it, you have to

adjust for error and there's a few different ways of doing that.

And the most common approaches

are testing or hypothesis testing and estimation of parameter values. The next step in our

discussion of "Statistics and Inference" is Hypothesis Testing. A very common procedure

in some fields of research. I like to think of it as put your money where your mouth is

and test your theory. Here's the Wright brothers out testing their plane. Now the basic idea

behind hypothesis testing is this, and you start out with a question. You start out with

something like this: What is the probability of X occurring by chance, if randomness or

meaningless sampling variation is the only explanation? Well, the response is this, if

the probability of that data arising by chance when nothing's happening is low, then you

reject randomness as a likely explanation. Okay, there's a few things I can say about

this. #1, it's really common in scientific research, say for instance in the social sciences,

it's used all the time. #2, this kind of approach can be really helpful in medical diagnostics,

where you're trying to make a yes/no decision; does a person have a particular disease.

And

3, really anytime you're trying to make a go/no go decision, which might be made for

instance with a purchasing decision for a school district or implementing a particular

law, You base it on the data and you have to make a yes/no. Hypothesis testing might

be helpful in those situations. Now, you have to have hypotheses to do hypothesis testing.

You start with H0, which is shorthand for the null hypothesis. And what that is in larger,

what that is in lengthier terms is that there is no systematic effect between groups, there's

no effect between variables and random sampling error is the only explanation for any observed

differences you see. And then contrast that with HA, which is the alternative hypothesis.

And this really just says there is a systematic effect, that there is in fact a correlation

between variables, that there is in fact a difference between two groups, that this variable

does in fact predict the other one.

Let's take a look at the simplest version of this

statistically speaking. Now, what I have here is a null distribution. This is a bell curve,

it's actually the standard normal distribution. Which shows z-scores in relative frequency,

and what you do with this is you mark off regions of rejection. And so I've actually

shaded off the highest 2.5% of the distribution and the lowest 2.5%.

What's funny about this

is, is that even though I draw it +/- 3, it looks like 0. It's actually infinite and asymptotic.

But, that's the highest and lowest 2.5% collectively leaves 95% in the middle. Now, the idea is

then that you gather your data, you calculate a score for you data and you see where it

falls in this distribution. And I like to think of that as you have to go down one path

to the other, you have to make a decision. And you have to decide to whether to retain

your null hypothesis; maybe it is random, or reject it and decide no I don't think it's

random.

The trick is, things can go wrong. You can get a false positive, and this is

when the sample shows some kind of statistical effect, but it's really randomness. And so

for instance, this scatterplot I have here, you can see a little down hill association

here but this is in fact drawn from data that has a true correlation of zero. And I just

kind of randomly sampled from it, it took about 20 rounds, but it looks negative but

really there's nothing happening. The trick about false positives is; that's conditional

on rejecting the null. The only way to get a false positive is if you actually conclude

that there's a positive result. It goes by the highly descriptive name of a Type I error,

but you get to pick a value for it, and .05 or a 5% risk if you reject the null hypothesis,

that's the most common value. Then there's a false negative. This is when the data looks

random, but in fact, it's systematic or there's a relationship. So for instance, this scatterplot

it looks like there's pretty much a zero relationship, but in fact this came from two variables that

were correlated at .25, that's a pretty strong association.

Again, I randomly sampled from

the data until I got a set that happened to look pretty flat. And a false negative is

conditional on not rejecting the null. You can only get a false negative if you get a

negative, you say there's nothing there. It's also called a Type II error and this is a

value that you have to calculate based on several elements of your testing framework,

so it's something to be thoughtful of. Now, I do have to mention one thing, big security

notice, but wait. The problem with Hypothesis Testing; there's a few. #1, it's really easy

to misinterpret it. A lot of people say, well if you get a statistically significant result,

it means that it's something big and meaningful. And that's not true because it's confounded

with sample size and a lot of other things that don't really matter. Also, a lot of other

people take exception with the assumption of a null effect or even a nil effect, that

there's zero difference at all. And that can be, in certain situations can be an absurd

claim, so you've got to watch out for that.

There's also bias from the use of cutoff.

Anytime you have a cut off, you're going to have problems where you have cases that would

have been slightly higher, slightly lower. It would have switched on the dichotomous

outcome, so that is a problem. And then a lot of people say, it just answers the wrong

question, because "What it's telling you is what's the probability of getting this data

at random?" That's not what most people care about. They want it the other way, which is

why I mentioned previously Bayes theorem and I'll say more about that later. That being

said, Hypothesis Testing is still very deeply ingrained, very useful in a lot of questions

and has gotten us really far in a lot of domains. So in sum, let me say this. Hypothesis Testing

is very common for yes/no outcomes and is the default in many fields. And I argue it

is still useful and information despite many of the well substantiated critiques. We'll

continue in "Statistics and Inference" by discussing Estimation. Now as opposed to Hypothesis

Testing, Estimation is designed to actually give you a number, give you a value.

Not just

a yes/no, go/no go, but give you an estimate for a parameter that you're trying to get.

I like to think of it sort of as a new angle, looking at something from a different way.

And the most common, approach to this is Confidence Intervals. Now, the important thing to remember

is that this is still an Inferential procedure. You're still using sample data and trying

to make conclusions about a larger group or population. The difference here, is instead

of coming up with a yes/no, you'd instead focus on likely values for the population

value.

Most versions of Estimation are closely related to Hypothesis Testing, sometimes seen

as the flip side of the coin. And we'll see how that works in later videos. Now, I like

to think of this as an ability to estimate any sample statistic and there's a few different

versions. We have Parametric versions of Estimation and Bootstrap versions, that's why I got the

boots here. And that's where you just kind of randomly sample from the data, in an effort

to get an idea of the variability.

You can also have central versus noncentral Confidence

Intervals in the Estimation, but we are not going to deal with those. Now, there are three

general steps to this. First, you need to choose a confidence level. Anywhere from say,

well you can't have a zero, it has to be more than zero and it can't be 100%. Choose something

in between, 95% is the most common. And what it does, is it gives you a range a high and

a low. And the higher your level of confidence the more confident you want to be, the wider

the range is going to be between your high and your low estimates. Now, there's a fundamental

trade off in what' happening here and the trade off between accuracy; which means you're

on target or more specifically that your interval contains the true population value. And the

idea is that leads you to the correct Inference. There's a tradeoff between accuracy and what's

called Precision in this context.

And precision means a narrow interval, as a small range

of likely values. And what's important to emphasize is this is independent of accuracy,

you can have one without the other! Or neither or both. In fact, let me show you how this

works. What I have here is a little hypothetical situation, I've got a variable that goes from

10 to 90, and I've drawn a thick black line at 50. If you think of this in terms of percentages

and political polls, it makes a very big difference if you're on the left or the right of 50%.

And then I've drawn a dotted vertical line at 55 to say that that's our theoretical true

population value.

And what I have here is a distribution that shows possible values

based on our sample data. And what you get here is it's not accurate, because it's centered

on the wrong thing. It's actually centered on 45 as opposed to 55. And it's not precise,

because it's spread way out from may be 10 to almost 80. So, this situation the data

is no help really at all. Now, here's another one. This is accurate because it's centered

on the true value. That's nice, but it's still really spread out and you see that about 40%

of the values are going to be on the other side of 50%; might lead you to reach the wrong

conclusion.

That's a problem! Now, here's the nightmare situation. This is when you

have a very very precise estimate, but it's not accurate; it's wrong. And this leads you

to a very false sense of security and understanding of what's going on and you're going to totally

blow it all the time. The ideal situation is this: you have an accurate estimate where

the distribution of sample values is really close to the true population value and it's

precise, it's really tightly knit and you can see that about 95% of it is on the correct

side of 50 and that's good.

If you want to see all four of them here at once, we have

the precise two on the bottom, the imprecise ones on the top, the accurate ones on the

right, the inaccurate ones on the left. And so that's a way of comparing it. But, no matter

what you do, you have to interpret confidence interval. Now, the statistically accurate

way that has very little interpretation is this: you would say the 95% confidence interval

for the mean is 5.8 to 7.2. Okay, so that's just kind of taking the output from your computer

and sticking it to sentence form.

The Colloquial Interpretation of this goes like this: there

is a 95% chance that the population mean is between 5.8 and 7.2. Well, in most statistical

procedures, specifically frequentist as opposed to bayesian you can't do that. That implies

the population mean shifts, that's not usually how people see it. Instead, a better interpretation

is this; 95% of confidence intervals for randomly selected samples will contain the population

mean. Now, I can show you this really easily, with a little demonstration. This is where

I randomly generated data from a population with a mean of 55 and I got 20 different samples.

And I got the Confidence Interval from each sample and I charted the high and the low.

And the question is, did it include the true population value.

And you can see of these

20, 19 included it, some of them barely made it. If you look at sample #1 on the far left;

barely made it. Sample #8, it doesn't look like it made it, sample 20 on the far right,

barely made it on the other end. Only one missed it completely, that sample #2, which

is shown in red on the left. Now, it's not always just one out of twenty, I actually

had to run this simulation about 8 times, because it gave me either zero or 3, or 1

or two, and I had to run it until I got exactly what I was looking for here,. But this is

what you would expect on average. So, let's say a few things about this. There are somethings

that affect the width of a Confidence Interval. The first is the confidence level, or CL.

Higher confidence levels create wider intervals. The more certain you have to be, you're going

to give a bigger range to cover your basis. Second, the Standard Deviation or larger standard

deviations create wider intervals. If the thing that you are studying is inherently

really variable, then of course you're estimate of the range is going to be more variable

as well.

And then finally there is the n or the sample size. This one goes the other way.

Larger sample sizes create narrower intervals. The more observations you have, the more precise

and the more reliable things tend to be. I can show you each of these things graphically.

Here we have a bunch of Confidence Intervals, where I am simply changing the confidence

level from .50 at the low left side to .999 and as you can see, it gets much bigger as

we increase. Next one is Standard Deviation. As the sample standard deviation increases

from 1 to 16, you can see that the interval gets a lot bigger. And then we have sample

size going from just 2 up to 512; I'm doubling it at each point.

And you can see how the

interval gets more and more and more precise as we go through. And so, let's say this to

sum up our discussion of estimation. Confidence Intervals which are the most common version

of Estimation focus on the population parameter. And the variation in the data is explicitly

included in that Estimation. Also, you can argue that they are more informative, because

not only do they tell you whether the population value is likely, but they give you a sense

of the variability of the data itself, and that's one reason why people will argue that

confidence levels should always be included in any statistical analysis. As we continue

our discussion on "Statistics and Data Science", we need to talk about some of the choices

you have to make, some of the tradeoffs and some of the effects that these things have.

We'll begin by talking about Estimators, that is different methods for estimating parameters.

I like to think of it as this, "What kind of measuring stick or standard are you going

to be using?" Now, we'll begin with the most common.

This is called OLS, which is actually

short for Ordinary Least Squares. This is a very common approach, it's used in a lot

of statistics and is based on what is called the sum of squared errors, and it's characterized

by an acronym called BLUE, which stands for Best Linear Unbiased Estimator. Let me show

you how that works. Let's take a scatterplot here of an association between two variables.

This is actually the speed of a car and the distance to stop from about the ‘20's I

think. We have a scatterplot and we can draw a straight regression line right through it.

Now, the line I've used is in fact the Best Linear Unbiased Estimate, but the way that

you can tell that is by getting what are called the Residuals. If you take each data point

and draw a perfectly vertical line up or down to the regression line, because the regression

line predicts what the value would be for that value on the X axis.

Those are the residuals.

Each of those individual, vertical lines is Residual. You square those and you add them

up and this regression line, the gray angled line here will have the smallest sum of the

squared residuals of any possible straight line you can run through it. Now, another

approach is ML, which stands for Maximum Likelihood. And this is when you choose parameters that

make the observed data most likely. It sounds kind of weird, but I can demonstrate it, and

it's based on a kind of local search. It doesn't always find the best, I like to think of it

here like the person here with a pair of binoculars, looking around them, trying hard to find something,

but you could theoretically miss something. Let me give a very simple example of how this

works. Let's assume that we're trying to find parameters that maximize the likelihood of

this dotted vertical line here at 55, and I've got three possibilities. I've got my

red distribution which is off to the left, blue which is a little more centered and green

which is far to the right.

And these are all identical, except they have different means,

and by changing the means, you see there the one that is highest where the dotted line

is the blue one. And so, if the only thing we are doing is changing the mean, and we

are looking at these three distributions, then the blue one is the one that has the

maximum likelihood for this particular parameter. On the other hand, we could give them all

the same meaning right around 50, and vary their standard deviations instead and so they

spread out different amounts. In this case, the red distribution is highest at the dotted

vertical line and so it has the maximum value. Or if you want to, you can vary both the mean

and the standard deviations simultaneously. And here green gets the slight advantage.

Now this is really a caricature of the process because obviously you would just want to center

it on the 55 and be done with it. The question is when you have many variables in your dataset.

Then it's a very complex process of choosing values that can maximize the association between

all of them. But you get a feel for how it works with this.

The third approach which

is pretty common is MAP or map for Maximum A Posteriori. This is a Bayesian approach

to parameter estimation, and what it does it adds the prior distribution and then it

goes through sort of an anchoring and adjusting process. What happens, by the way is stronger

prior estimates exert more influence on the estimate and that might mean for example larger

sample or more extreme values. And those have a greater influence on the posterior estimate

of the parameters. Now, what's interesting is that all three of these methods all connect

with each other. Let me show you exactly how they connect. The ordinary least squares,

OLS, this is equivalent to maximum likelihood, when it has normally distributed error terms.

And maximum likelihood, ML is equivalent to Maximum A Posteriori or MAP, with a uniform

prior distribution.

You want to put it another way, ordinary least squares or OLS is a special

case of Maximum Likelihood. And then maximum likelihood or ML, is a special case of Maximum

A Posteriori, and just in case you like it, we can put it into set notation. OLS is a

subset of ML is a subset of MAP, and so there are connections between these three methods

of estimating population parameters. Let me just sum it up briefly this way. The standards

that you use OLS, ML, MAP they affect your choices and they determine which parameters

best estimate what's happening in your data. Several methods exist and there's obviously

more than what I showed you right here, but many are closely related and under certain

circumstances they're all identical. And so it comes down to exactly what are your purposes

and what do you think is going to work best with the data that you have to give you the

insight that you need in your own project. The next step we want to consider in our "Statistics

and Data Science", are choices that we have to make.

Has to do with Measures of fit or

the correspondence between the data that we have and the model that you create. Now, turns

out there are a lot of different ways to measure this and one big question is how close is

close enough or how can you see the difference between the model and reality. Well, there's

a few really common approaches to this. The first one has what's called R2. That's kind

of the longer name, that's the coefficient of determination. There's a variation; adjusted

R2, which takes into consideration the number of variables. Then there's minus 2LL, which

is based on the likelihood ratio and a couple of variations. The Akaike Information Criterion

or AIC and the Bayesian Information Criterion or BIC. Then there's also Chi-Squared, it's

actually a Greek c, it looks like a x, but it's actually c and it's chi-squared. And

so let's talk about each of these in turn. First off is R2, this is the squared multiple

correlation or the coefficient of determination.

And what it does is it compares the variance

of Y, so if you have an outcome variable, it looks like the total variance of that and

compares it to the residuals on Y after you've made your prediction. The scores on squared

range from 0 to 1 and higher is better. The next is -2 Log-likelihood that's the likelihood

ratio or like I just said the -2 log likelihood. And what this does is compares the fit of

nested models, we have a subset then a larger set, than the larger set overall. This approach

is used a lot in logistic regression when you have a binary outcome. And in general,

smaller values are considered better fit. Now, as I mentioned there are some variations

of this. I like to think of variations of chocolate. The -2 log likelihood there's the

Akaike Information Criterion (AIC) and the Bayesian Information Criterion (BIC) and what

both of these do, they adjust for the number of predictors.

Because obviously you're going

to have a huge number of predictors, you're going to get a really good fit. But you're

probably going to have what is called overfitting, where your model is tailored to specifically

to the data you currently have and that doesn't generalize well. These both attempt to reduce

the effect of overfitting. Then there's chi-squared again. It's actually a lower case Greek c,

looks like an x and chi-squared is used for examining the deviations between two datasets.

Specifically between the observed dataset and the expected values or the model you create,

we expect this many frequencies in each category.

Now, I'll just mention when I go into the

store there's a lot of other choices, but these are some of the most common standards,

particularly the R2. And I just want to say, in sum, there are many different ways to assess

the fit that corresponds between a model and your data. And the choices effect the model,

you know especially are you getting penalized for throwing in too many variables relative

to your number of cases? Are you dealing with a quantitative or binary outcome? Those things

all matter, and so the most important thing as always, my standing advice is keep your

goals in mind and choose a method that seems to fit best with your analytical strategy

and the insight you're trying to get from your data. The "Statistics and Data Science"

offers a lot of different choices. One of the most important is going to be feature

selection, or the choice of variables to include in your model. It's sort of like confronting

this enormous range of information and trying to choose what matters most.

Trying to get

the needle out of the haystack. The goal of feature selection is to select the best features

or variables and get rid of uninformative/noisy variables and simplify the statistical model

that you are creating because that helps avoid overfitting or getting a model that works

too well with the current data and works less well with other data. The major problem here

is Multicollinearity, a very long word. That has to do with the relationship between the

predictors and the model. I'm going to show it to you graphically here. Imagine here for

instance, we've got a big circle here to represent the variability in our outcome variable; we're

trying to predict it. And we've got a few predictors. So we've got Predictor # 1 over

here and you see it's got a lot of overlap, that's nice. Then we've got predictor #2 here,

it also has some overlap with the outcome, but it's also overlaps with Predictor 1.

And

then finally down here, we've got Predictor 3, which overlaps with both of them. And the

problem rises the overlap between the predictors and the outcome variable. Now, there's a few

ways of dealing with this, some of these are pretty common. So for instance, there's the

practice of looking at probability values and regression equations, there's standardized

coefficients and there's variations on sequential regression. There are also, there's newer

procedures for dealing with the disentanglement of the association between the predictors.

There's something called Commonality analysis, there's Dominance Analysis, and there are

Relative Importance Weights. Of course there are many other choices in both the common

and the newer, but these are just a few that are worth taking a special look at. First,

is P values or probability values. This is the simplest method, because most statistical

packages will calculate probability values for each predictor and they will put little

asterisks next to it. And so what you're doing is you're looking at the p-values; the probabilities

for each predictor or more often the asterisks next to it, which sometimes give it the name

of Star Search.

You're just kind of cruising through a large output of data, just looking

for the stars or asterisks. This is fundamentally a problematic approach for a lot of reasons.

The problem here, is your looking individually and it inflates false positives. Say you have

20 variables. Each is entered and tested with an alpha or a false positive of 5%. You end

up with nearly a 65% chance of a least one false positive in there. That's distorted

by sample size, because with a large enough sample anything can become statistically significant.

And so, relying on p-values can be a seriously problematic approach. Slightly better approach

is to use Betas or Standardized regression coefficients and this is where you put all

the variables on the same scale. So, usually standardized from zero and then to either

minus 1/plus 1 or with a standardized deviation of 1.

The trick is though, they're still in

the context of each other and you can't really separate them because those coefficients are

only valid when you take that group of predictors as a whole. So, one way to try and get around

that is to do what they call stepwise procedures. Where you look at the variables in sequence,

there's several versions of sequential regression that'll allow you to do that. You can put

the variables into groups or blocks and enter them in blocks and look at how the equation

changes overall. You can examine the change in fit in each step. The problem with a stepwise

procedure like this, is it dramatically increases the risk of overfitting which again is a bad

thing if you want to generalize your data. And so, to deal with this, there is a whole

collection of newer methods, a few of them include commonality analysis, which provides

separate estimates for the unique and shared contributions of each variable.

Well, that's

a neat statistical trick but the problem is, it just moves the problem of disentanglement

to the analyst, so you're really not better off then you were as far as I can tell. There's

dominance analysis, which compares every possible subset of Predictors. Again, sounds really

good, but you have the problem known as the combinatorial explosion. If you have 50 variables

that you could use, and there are some that have millions of variables, with 50 variables,

you have over 1 quadrillion possible combinations, you're not going to finish that in your lifetime.

And it's also really hard to get things like standard errors and perform inferential statistics

with this kind of model. Then there's also something that's even more recent than these

others and that's called relative importance weights.

And what that does is creates a set

of orthogonal predictors or uncorrelated with each other, basing them off of the originals

and then it predicts the scores and then it can predict the outcome without the multicollinear

because these new predictors are uncorrelated. It then rescales the coefficients back to

the original variables, that's the back-transform. Then from that it assigns relative importance

or a percentage of explanatory power to each predictor variable. Now, despite this very

different approach, it tends to have results that resemble dominance analysis.

It's actually

really easy to do with a website, you just plug in your information and it does it for

you. And so that is yet another way of dealing with a problem multicollinearity and trying

to disentangle the contribution of different variables. In sum, let's say this. What you're

trying to do here, is trying to choose the most useful variables to include into your

model. Make it simpler, be parsimonious. Also, reduce the noise and distractions in your

data. And in doing so, you're always going to have to confront the ever present problem

of multicollinearity, or the association between the predictors in your model with several

different ways of dealing with that. The next step in our discussion of "Statistics and

the Choices you have to Make", concerns common problems in modeling. And I like to think

of this is the situation where you're up against the rock and the hard place and this is where

the going gets very hard. Common problems include things like Non-Normality, Non-Linearity,

Multicollinearity and Missing Data. And I'll talk about each of these.

Let's begin with

Non-Normality. Most statistical procedures like to deal with nice symmetrical, unimodal

bell curves, they make life really easy. But sometimes you get really skewed distribution

or you get outliers. Skews and outliers, while they happen pretty often, they're a problem

because they distort measures like the mean gets thrown off tremendously when they have

outliers. And they throw off models because they assume the symmetry and the unimodal

nature of a normal distribution. Now, one way of dealing with this as I've mentioned

before is to try transforming the data, taking the logarithm, try something else. But another

problem may be that you have mixed distributions, if you have a bimodal distribution, maybe

what you really have here is two distributions that got mixed together and you may need to

disentangle them through exploring your data a little bit more. Next is Non-Linearity.

The gray line here is the regression line, we like to put straight lines through things

because it makes the description a lot easier. But sometimes the data is curved and this

is you have a perfect curved relationship here, but a straight line doesn't work with

that.

Linearity is a very common assumption of many procedures especially regression.

To deal with this, you can try transforming one or both of the variables in the equation

and sometimes that manages to straighten out the relationship between the two of them.

Also, using Polynomials. Things that specifically include curvature like squares and cubed values,

that can help as well. Then there's the issues of multicollinearity, which I've mentioned

previously. This is when you have correlated predictors, or rather the predictors themselves

are associated to each other. The problem is, this can distort the coefficients you

get in the overall model. Some procedures, it turns out are less affected by this than

others, but one overall way of using this might be to simply try and use fewer variables.

If they're really correlated maybe you don't need all of them.

And there are empirical

ways to deal with this, but truthfully, it's perfectly legitimate to use your own domain

expertise and your own insight to the problem. To use your theory to choose among the variables

that would be the most informative. Part of the problem we have here, is something called

the Combinatorial Explosion. This is where combinations of variables or categories grow

too fast for analysis. Now, I've mentioned something about this before. If you have 4

variables and each variable has two categories, then you have 16 combinations, fine you can

try things 16 different ways. That's perfectly doable.

If you have 20 variables with five

categories; again that's not to unlikely, you have 95 trillion combinations, that's

a whole other ball game, even with your fast computer. A couple of ways of dealing with

this, #1 is with theory. Use your theory and your own understanding of the domain to choose

the variables or categories with the greatest potential to inform. You know what you're

dealing with, rely on that information. Second is, there are data driven approaches. You

can use something called a Markov chain Monte Carlo model to explore the range of possibilities

without having to explore the range of possibilities of each and every single one of your 95 trillion

combinations. Closely related to the combinatorial explosion is the curse of dimensionality.

This is when you have phenomena, you're got things that may only occur in higher dimensions

or variable sets.

Things that don't show up until you have these unusual combinations.

That may be true of a lot of how reality works, but the project of analysis is simplification.

And so you've got to try to do one or two different things. You can try to reduce. Mostly

that means reducing the dimensionality of your data. Reduce the number of dimensions

or variables before you analyze. You're actually trying to project the data onto a lower dimensional

space, the same way you try to get a shadow of a 3D object. There's a lot of different

ways to do that. There's also data driven methods. And the same method here, a Markov

chain Monte Carlo model, can be used to explore a wide range of possibilities.

Finally, there

is the problem of Missing Data and this is a big problem. Missing data tends to distort

analysis and creates bias if it's a particular group that's missing. And so when you're dealing

with this, what you have to do is actually check for patterns and missingness, you create

new variables that indicates whether or not a variable is missing and then you see if

that is associated with any of your other variables. If there's not strong patterns,

then you can impute missing values. You can put in the mean or the median, you can do

Regression Imputation, something called Multiple Imputation, a lot of different choices. And

those are all technical topics, which we will have to talk about in a more technically oriented

series.

But for right now, in terms of the problems that can come up during modeling,

I can summarize it this way. #1, check your assumptions at every step. Make sure that

the data have the distribution that you need, check for the effects of outliers, check for

ambiguity and bias. See if you can interpret what you have and use your analysis, use data

driven methods but also your knowledge of the theory and the meaning of things in your

domain to inform your analysis and find ways of dealing with these problems. As we continue

our discussion of "Statistics and the Choices that are Made", one important consideration

is Model Validation. And the idea here is that as you are doing your analysis, are you

on target? More specifically, the model that you create through regression or whatever

you do, your model fits the sample beautifully, you've optimized it there. But, will it work

well with other data? Fundamentally, this is the question of Generalizability, also

sometimes called Scalability.

Because you are trying to apply in other situations, and

you don't want to get too specific or it won't work in other situations. Now, there are a

few general ways of dealing with this and trying to get some sort of generalizability.

#1 is Bayes; a Bayesian approach. Then there's Replication. Then there's something called

Holdout Validation, then there is Cross-Validation. I'll discuss each one of these very briefly

in conceptual terms. The first one is Bayes and the idea here is you want to get what

are called Posterior Probabilities. Most analyses give you the probability value for the data

given; the hypothesis, so you have to start with an assumption about the hypothesis. But

instead, it's possible to flip that around by combining it with special kind of data

to get the probability of the hypothesis given the data. And that is the purpose of Bayes

theorem; which I've talked about elsewhere. Another way of finding out how well things

are going to work is through Replication.

That is, do the study again. It's considered

the gold standard in many different fields. The question is whether you need an exact

replication or if a conceptual one that is similar in certain respects. You can argue

for both ways, but one thing you do want to do is when you do a replication then you actually

want to combine the results. And what's interesting is the first study can serve as the Bayesian

prior probability for the second study. So you can actually use meta-analysis or Bayesian

methods for combining the data from the two of them. Then there's hold out validation.

This is where you build your statistical model on one part of the data and you test it on

the other.

I like to think of it as the eggs in separate baskets. The trick is that you

need a large sample in order to have enough to do these two steps separately. On the other

hand, it's also used very often in data science competitions, as a way of having a sort of

gold standard for assessing the validity of a model. Finally, I'll mention just one more

and that's Cross-Validation. Where you use the same data for training and for testing

or validating. There's several different versions of it, and the idea is that you're not using

all the data at once, but you're kind of cycling through and weaving the results together.

There's Leave-one-out, where you leave out one case at a time, also called LOO. There's

Leave-p-out, where you leave out a certain number at each point.

There's k-fold where

you split the data into say for instance 10 groups and you leave out one and you develop

it on the other nine, then you cycle through. And there's repeated random subsampling, where

you use a random process at each point. Any of those can be used to develop the model

on one part of the data and tested on another and then cycle through to see how well it

holds up on different circumstances. And so in sum, I can say this about validation. You

want to make your analysis count by testing how well your model holds up from the data

you developed it on, to other situations. Because that is what you are really trying

to accomplish. This allows you to check the validity of your analysis and your reasoning

and it allows you to build confidence in the utility of your results.

To finish up our

discussion of "Statistics and Data Science" and the choices that are involved, I want

to mention something that really isn't a choice, but more an attitude. And that's DIY, that's

Do it yourself. The idea here is, you know really you just need to get started. Remember

data is democratic. It's there for everyone, everybody has data. Everybody works with data

either explicitly or implicitly. Data is democratic, so is Data Science. And really, my overall

message is You can do it! You know, a lot of people think you have to be this cutting

edge, virtual reality sort of thing. And it's true, there's a lot of active development

going on in data science, there's always new stuff. The trick however is, the software

you can use to implement those things often lags. It'll show up first in programs like

R and Python, but as far as it showing up in a point click program that could be years.

What's funny though, is often these cutting edge developments don't really make much of

a difference in the results of the interpretation.

They may in certain edge cases, but usually

not a huge difference. So I'm just going to say analyst beware. You don't have to necessarily

do it, it's pretty easy to do them wrong and so you don't have to wait for the cutting

edge. Now, that being said, I do want you to pay attention to what you are doing. A

couple of things I have said repeatedly is "Know your goal".

Why are you doing this study?

Why are you analyzing the data, what are you hoping to get out of it? Try to match your

methods to your goal, be goal directed. Focus on the usability; will you get something out

of this that people can actually do something with. Then, as I've mentioned with that Bayesian

thing, don't get confused with probabilities. Remember that priors and posteriors are different

things just so you can interpret things accurately. Now, I want to mention something that's really

important to me personally. And that is, beware the trolls. You will encounter critics, people

who are very vocal and who can be harsh and grumpy and really just intimidating. And they

can really make you feel like you shouldn't do stuff because you're going to do it wrong.

But the important thing to remember is that the critics can be wrong.

Yes, you'll make

mistakes, everybody does. You know, I can't tell you how many times I have to write my

code more than once to get it to do what I want it to do. But in analysis, nothing is

completely wasted if you pay close attention. I've mentioned this before, everything signifies.

Or in other words, everything has meaning. The trick is that meaning might not be what

you expected it to be. So you're going to have to listen carefully and I just want to

reemphasize, all data has value. So make sure your listening carefully. In sum, let's say

this: no analysis is perfect. The real questions is not is your analysis perfect, but can you

add value? And I'm sure that you can. And fundamentally, data is democratic. So, I'm

going to finish with one more picture here and that is just jump write in and get started.

You'll be glad you did. To wrap up our course "Statistics and Data Science", I want to give

you a short conclusion and some next steps. Mostly I want to give a little piece of advice

I learned from a professional saxophonist, Kirk Whalum.

And he says there's "There's

Always Something To Work On", there's always something you can do to try things differently

to get better. It works when practicing music, it also works when you're dealing with data.

Now, there are additional courses, here at datalabb.cc that you might want to look at.

They are conceptual courses, additional high-level overviews on things like machine learning,

data visualization and other topics. And I encourage you to take a look at those as well,

to round out your general understanding of the field. There are also however, many practical

courses. These are hands on tutorials on these statistical procedures I've covered and you

learn how to do them in R, Python and SPSS and other programs. But whatever you're doing,

keep this other little piece of advice from writers in mind, and that is "Write what you

know". And I'm going to say it this way.

Explore and analyze and delve into what you know.

Remember when we talked about data science and the Venn Diagram, we've talked about the

coding and the stats. But don't forget this part on the bottom. Domain expertise is just

as important to good data science as the ability to work with computer coding and the ability

to work with the numbers and quantitative skills. But also, remember this. You don't

have to know everything, your work doesn't have to be perfect. The most important thing

is just get started, you'll be glad you did. Thanks for joining me and good luck!.