Programming Isn't Math, It's Linguistics
767 segments
[Music]
Huh?
[Music]
[Music]
[Applause]
[Music]
What you just saw is a quirk in every
single language. This is actually a
subset of issues called linguistic
ambiguity. If I'd had more context, I
might have understood the original
intent. But you know we don't always
have those kinds of luxuries.
Interestingly enough, programming
languages face very similar challenges.
We as programmers instead of
communicating personto person are now
communicating person to compiler.
Whether you're coding in C++, Java, Rust
or any programming language, compilers
being a human-made invention are also
imperfect at interpretation. This is why
since they can face very similar issues,
programming can also fundamentally be
considered a linguistics issue. First of
all, I have a question for you. I've got
a very small program over here. What do
you think the output of this program is
going to be? We've got a few options
here. First would be it prints out
partially true. Second would be it
prints out totally false. Or I guess a
third option could be it prints out
absolutely nothing. So, pause for a
second and give me your best guess.
might not be exactly what you would
expect, which you probably got because
I'm sitting here asking that as a
question.
So, let's go over. I'm going to compile
my program. We're going to ignore our
warning. I'm going to do /hm.
We've got our answer totally false. Now,
you might be thinking, looking back at
the code, this really doesn't make any
sense. I would probably guess that the
output would be absolutely nothing. But
instead, we're getting this totally
false statement. And that's because this
is part of a language ambiguity in the
syntax called a dangling else. And we're
going to get into a real world example
of this a little bit later. Nearly every
distinct programming language issue can
be traced back to a very similar human
language issue. In all these examples,
I'm going to talk about the English
language quirk and then I'm going to
talk about the very equivalent
programming language quirk and I'm going
to kind of sort them into different
buckets of the three distinct
compilation phases which I'll also talk
about in a second. Now, English in all
languages for that matter can have a lot
of different ambiguities and one of my
favorite examples of this is actually a
book from 1967 called Beyond Language:
Adventures in Word and Thought. And this
explores recreational linguistics. And
actually, I think that would be a really
good name for coming up with insults,
like an official name. I'm not insulting
you. I'm just practicing my recreational
linguistics. One of the best sentences
from the book, I think, is buffalo
buffalo buffalo buffalo buffalo buffalo
buffalo buffalo. Yes, that is a real
sentence. Don't even question me. Just
go look it up. But that's because the
word buffalo can act as both a noun and
a verb. But relating this back to
programming languages with compilers,
these kinds of ambiguities are even more
serious and important to clarify. If you
think about it, compilers need to be
correct 100% of the time, but they also
don't have the advantage of being able
to ask for more context when they need
it. Let's take a minute to think about
how compilers actually parse the syntax
of a program. Now, compilers have it a
little bit worse off than humans because
they have to do this in multiple
different phases of parsing. And you can
kind of break this up into three
distinct stages. The first stage is
going to be the lexical analysis. The
second is going to be the parsing stage.
And then the final stage is going to be
the semantic analysis stage. So during
the lexical analysis, it's going through
and it's breaking up the tokens of your
program. So like the variables, the
operators, and it's parsing those out
into distinct tokens. And then the next
phase is going to take all of those
tokens in your program and turn those
into an abstract syntax tree or an a.
And during that second parsing phase,
it's this is going to be the part where
it's finding all those syntax errors
that are so irritating in your code,
like if you forgot to add a semicolon or
something in your C++ program. Side
note, when I was in school, I remember I
came up with the ten commandments of
computing and the first one was remember
the semicolon to keep it holy.
Don't forget guys, during the final
stage, the compiler uses that a and
parses it to look for semantic
correctness in your code. For example,
it does type checking or maybe checks
for scope resolution inside of your
code. So if you have any kind of logical
issues with your code, this is where the
compiler discovers those. Like for
example, if you're trying to use a
variable that's out of scope. Now, if
you're looking for a very understandable
example of how a compiler works, I
highly encourage you to take a look at
the local C compiler, which is available
open source on GitHub, and this is going
to be a lot more readable than trying to
understand a really big compiler like
GCC or clang for example, that might be
spread across like literally hundreds of
different files. So, this is really well
documented and easy to understand. I'm
going to go over to the LCC GitHub
repository. And just side note on here,
this has to be the oldest commit I've
ever seen in my life from 23 years ago.
That's absolutely insane. That just
blows my mind. I'm going to go over to
the source code and you can see we have
all of the different files that kind of
make sense to put it into the three
three different phases of the compiler
analysis. You can see like the tree
parsing it into an abstract syntax tree
and you can see the process where it's
doing both semantic and syntactic
analysis. If you look at tree, I think
it's very descriptive over here. You can
see it actually going through and
parsing the different op codes inside of
this and then constructing that a so it
can do the remaining analysis on it.
Going back over to our linguistics
challenges, let's think about how these
ambiguities can come into existence.
Now, older languages are going to face a
lot of these challenges because if you
think about it, you have an old existing
language and you're constantly adding
new additions on top of that. So, it
kind of makes sense how these issues can
happen. But this can also happen a lot
inside of younger languages as well,
which are also not perfect. I think a
really good example of this is the word
literally. Now, if you think about it in
old original English, literally
literally meant not figuratively. So it
actually happened. But as English has
progressed, literally has started to
change meaning to actually mean
figuratively.
And a lot of people use it now as kind
of a synonym for very, which drives me
literally nuts.
Now, now going back over to some C++
examples, let's take a look at them. And
we're going to use C++ since that's
literally the best programming language.
Remember those three different compiler
phrases I talked about previously? Let's
take a look at each one of those
individually. So for our first bucket,
those would be syntactic quirks. And in
linguistics terms, you could call this a
grammar ambiguity, aka what happens if
the same token fits into multiple
different parse trees. And C++ has a
really good example of this one, and
it's very fittingly called the most
vexing parse. Think about this English
sentence for a second. Visiting
relatives can be boring. Now, if you
look at the word visiting, is visiting
acting as a verb here, like the act of
visiting your relatives, or is visiting
acting as an adjective describing a
noun, as in you're visiting relatives,
can be boring. If you see, the word
visiting can have multiple different
meanings, completely changing the
structure of the sentence. Now, keeping
that English example in your mind, let's
go over to a quick C++ example. So for
this example, I have a very short
program right here. I have a couple
declarations up top. We have a simple
function definition that's going to be
declaring a function named m that's
going to be returning a vector of
integers. And then we have our
definition for that function down here.
Keeping that in mind for the next few
seconds or or minutes pro probably
seconds, I hope seconds. Let's go down
to our main function and instead let's
say I want to declare a variable. And
we're going to name our variable n. And
this is of type vector. And if I want to
default initialize this with a few
values, I can initialize this with let's
say 10 values all initialized to the
integer five. And if I go down here,
adding a couple of extra values just
because why not. And I'm going to be
printing out the size of that vector as
well as all of the numbers inside of it.
So let's just run this for sanity and
make sure this works.
So we do slash vex. And you can see
we've just got our vector printed out as
well as the size just like we would
expect. But now take a second and think,
what if we didn't want to have default
initialized values inside of our vector?
What if we just wanted to have an empty
n vector? You might think we can just
get rid of those values. And let's do
something like
this for example. This looks right,
right? But why am I getting some syntax
errors over here? Let me save this and
run this. We'll compile
and you see doesn't compile. What's
going on here? So if I go up here, I can
see invalid range expression of type
vector int. No viable begin function
available. Well, that's very
interesting. See what's happening. If
you look at this closely, there is no
difference in the declaration of this
vector n as well as this vector m over
here. So the compiler is basically
saying, I don't know whether this is a
function definition or a function
declaration or if this is declaring a
new variable for a vector of integers.
This is an example of a syntax
ambiguity. Now the way to fix this,
let's get rid of this. It looks it looks
really right. So if you're a beginner,
you would look at this and you would
think, huh, what's wrong? And it's
probably why it got the name the most
vexing parse. But the way to fix this in
old C++ was like this where you have to
set your new variable equal to a vector
of integers. But this is just like the
longest syntax in the world. But it does
indeed get rid of our syntax errors. But
if you go over to C++ 11, luckily in
modern times we have this declaration
option available. we can just use curly
braces and it default initializes our
vector to nothing. No default values,
just calls the default constructor. And
if we want to make sure that this does
actually compile, got my dummy values
down here. So we should have our vector
size two. So I'm going to recompile.
Let's run our vex. And here we go.
Worked just fine. Now we can
unambiguously declare a vector of
integers as well as a function returning
a vector of integers. Confused?
Hopefully not. Our next syntactic quirk
is going to be the dangling else, which
we talked about briefly at the beginning
of this video. So, let's think about
this sentence for a second. If you see
Sam, if he's busy, leave a note.
Otherwise, call me. Now, this is kind of
ambiguous, but it's mostly related to a
punctuation problem here. Like what if I
don't see Sam at all? Am I supposed to
call you or not? So if we rearrange the
punctuation and the structure a little
bit and we get this, if you see Sam,
leave a note if he's busy, otherwise
call me. The meaning becomes a lot
clearer. So this kind of ambiguity of
association happens a lot in programming
as well. Going over to our C++ example
for our dangling else, I've got a really
simple program here. Now, if you got the
question wrong in the beginning of this
video, now you have a second chance. If
you can't remember what the question
was, you should rewind and come back
because you'll probably get this one
right. So, we have in our simple
program, this is kind of simulating a
real world scenarioish
of logging into a machine. So, we have a
couple statements here basically saying
like login was successful, login wasn't
successful, and let's go to our main
function. I've set this to is logged in
is true. So, we're going to successfully
log in and we're not an admin. So, we
should not be able to log in as an
admin. And then this is our nice little
login check over here. And then at the
end after we've logged in, we're going
to get like a nice little goodbye
message. So, let's take a closer look at
this if statement. And again, tell me
what you think is going to print here. I
should have kind of a a boolean and over
here. It should be if is logged in and
is admin, then we're going to log in as
an admin. Otherwise, we're just going to
completely fail. So, if I'm logged in,
but I'm not admin, I should really just
be able to log in and then get the
goodbye message. But let's go and
execute this. So, I'm going to compile
this.
Ignore our warning. It's going to be the
the phrase of the day, ignore all
warnings.
But what you see here, even though we
should be logged in, we're getting a
failed to log in. Get out. And then we
get our thanks for logging in with us
today. I didn't code that part. Great.
So, what's happening? If we look at this
if else, it looks like it should be
working. Right now, what's happening is
this else is ambiguous. It doesn't know
whether to associate this else with this
if statement or with this if statement.
Kind of a side statement here. I
honestly, personal pet peeve. I cannot
stand when you don't use the curly
braces. like always use the curly braces
because what space are you trying to
save? You're you're you're saving like a
bite of space. It's just absolutely not
worth it and it makes your code so much
less reason readable. So, let's go over
to the fix of this. So, I'm going to log
in or get rid of this login right here.
And if we fix this with some beautiful
curly braces, we can make this so much
less ambiguous. Now we know if logged in
and is admin. We have our login admin
right here. Otherwise, we're going to
fail that login if our login was
unsuccessful. So if I save this and
recompile,
you can see we've managed to fix all of
our challenges with the ambiguous else
association. All right, drum roll.
It's time to look at our next bucket.
And this is a whole different can of
worms.
It's not worms, guys. It's just monster.
It's not even monster worms.
So, for our next phase, remember the
first phase of the compiler parsing was
that lexical analysis phase. So, that's
taking in all of your characters and
parsing them into their respective
tokens. But this parsing can be pretty
tricky if you don't have the proper
context to know what token some
characters are going to map out to.
Think about this sentence for a second.
He said, she said hi. Now, we have the
auditory advantage of being able to hear
like my tone and inflection, so you can
probably understand what meaning I'm
trying to convey. But let's say you read
this sentence instead inside of a book
or something and you're looking at all
of the different punctuation inside of
it. In English, we have this concept of
nested generics. So, this is really
convenient for conveying meaning because
you can have your single quotes inside
of your double quotes to show what
you're actually trying to say. Now,
programming languages, especially
earlier ones, faced these same kind of
trickiness and challenges. Now, relating
this back over to our C++ example, we
can also have this kind of idea of
nested generics inside of the C++
programming language as well. So if I
took my previous vector of integers n
and I instead decided to nest this with
an additional type and make this a
vector of vector of integers then I can
put this one type inside of another type
and nest this. Now this looks perfectly
valid and it's going through I'm adding
some different vectors to this printing
this out once again. But let's see what
happens when we compile this. And I'm
going to go back in time for a second.
And I'm going to go compile this with
C++ 98. So let's do this.
Traveling back in time. And something
interesting is happening. So we're
actually getting an error here. And
let's read that error message cuz it's
really interesting. So error. A space is
required between consecutive right angle
brackets. What does that mean? Basically
that means that these two right angle
brackets are being recognized as a shift
operation. So this is going to be a
shift write operation and the compiler
during the syntax parsing process is
basically saying I don't know whether
this is a nested type or if this is a
shift operation. So it basically can't
tell what these tokens are supposed to
be mapped to. And as our C++ 98 compiler
is so kindly telling us, the old fix for
this was instead of doing this, you had
to actually add a space inside of here
so that it could tell it wasn't supposed
to be this shift write operation. And
you can do this and recompile with C++
98, but it won't accept any of my nice
pretty syntax over here in my for loops.
and you have to do this awful
iterator syntax. So, I'm not going to
put you through that. Instead, we're
going to travel forward in time and go
to C++ 20 and compile this. So, I'm
going to have my original version cuz
they figured out how to parse this
properly eventually. And instead of
doing C++ 98,
jumping forward to 20. And here we go.
We are we have successfully compiled
this and we can execute this. perfectly
fine. Wow, fascinating vectors. All
right, let's move on to our next bucket.
And this is probably going to be our
worst bucket, and this is semantics.
Now, context, context, context is so
important here, but I think English is
probably one of the worst languages for
this since we just have so many
overloaded terms. So, take a second and
think about this sentence. Polish
ambassadors are rare. Imagine you were
reading this and you couldn't hear how I
was pronouncing the word Polish. Well,
since I have the capital P at the
beginning, do I mean Polish or Polish? I
would say Polish ambassadors are
probably pretty rare. So rare they might
even be non-existent. Now, through
pronunciation and context, you can tell
what I'm trying to say, but when it's
written down, the only subtle
differentiator that's not always
available to us as a differentiator is
the capital P. So consider this sentence
instead. Meeting Polish ambassadors is
rare. Now we've changed this so we're
able to explicitly type cast the the
capital P to the word Polish. So we can
say for sure we do mean a Polish
ambassador. I think I'd enjoy being a
Polish ambassador. When I was a kid and
I was picking chores, I'd always pick
polishing as a chore. It's very, very
satisfying. I don't think I'm qualified
to be a Polish ambassador
unless you guys will take me. Now, a
great programmatic example of this is
dependent type names. So, if I go over
to my C++ program, you can see I have a
simple function template right here,
passing in a container, getting an
iterator for that container, and then
this allows me to print out the first
element inside of that. If I go down to
my main function, you can see I've got
two different vectors. One is a vector
of integers, and one is a vector of
strings. And I'm able to call the same
function and pass in those different
types of containers. Now, if I go up
here, let's focus on this line just a
little bit. I want to try to compile
this code and let's see what happens.
So, I'm going to press enter. Oh, and
we're getting an error. And the error
message is missing type name prior to
dependent type name. So, it's really
complaining about this particular line.
Now I want to copy this and let's go
over to our CPP reference and look at
our specific container that I'm using in
this case which is a vector. And if I go
down
I can see part of our member types we
have this const iterator available. Now
if I go back over to my code you can see
I'm expecting this to be a type const
iterator. The problem is this right
here. These two colons can also be
referring to potentially let's say a
static member variable. So the compiler
is saying I don't know which one you're
trying to do. And how do we fix that? We
can be explicit about it just like we
were inside of the Polish Polish example
where we used the capital P in the
sentence to distinguish that we meant
Polish. All we need to do is simply
specify explicitly that this is a type
name that we're expecting. So save this
and now if we compile our code comp
compilation works perfectly and we can
run this and see we can either print out
the first element as 42 the integer or
42 the string. Sometimes the smallest
tokens make the biggest difference about
how a statement is parsed. Really good
example of this is the humble colon
operator. Consider this sentence. pack
the following snacks, water, maps versus
pack the following snacks, water, maps.
Very big difference between the two
interpretations. So the colon here is
kind of like a whoop whoop to your brain
of telling you what you're supposed to
expect. Otherwise, in the second
sentence, I kind of feel like I'm about
to start eating water or maps. Not
really not really the most tasty meal
I've ever had. If you like C++, don't we
all? then you'll know that the template
feature is one of the most powerful
parts of the C++ programming language.
And much like the colon, the keyword
template in C++ is an extremely
important marker for the language to
understand. If you use the word
template, it completely changes the
interpretation of the C++ compiler. Now,
let's move on to our last C++ example.
And yes, it includes templates. Going to
go over to my program. I have a strct
tool right here with a function template
inside called perform action. And if I
go down here, I have another function
template. It's going to be our run
processor that's going to trigger our
perform action over here. And then I
have a processor strct that contains my
tool strct declared inside. And my main
function is simply going to trigger this
entire run processor process. How many
times can I say the word process in a
sentence?
So, if I go back up, let's find the
problematic line right here inside of
our run processor function template when
we're calling this perform action. Pay
attention to this particular syntax
right here. I'm going to compile this
and see what happens. Now,
interestingly, I'm getting an error
here. It says use template keyword to
treat perform action as a dependent
template name. Basically, what's
happening here is the compiler does not
realize that this is actually a
template. Now, it's looking at this
particular symbol right here, our less
than sign, and it's thinking that it is
actually a less than sign, because it's
not taking in the entire context of
this, that we're simply performing an
action on an integer. Now, how do we fix
this? So, what we need to do is we need
to be explicit with the compiler and
tell this you need to treat this as a
template. So simply enough, we'll just
say that this is a template and then we
can trigger our perform action template
with our integer and run our processor
on our tool ID 42. Makes a lot of sense,
guys. Trust me.
So I'm going to recompile this. Here we
go. Worked perfectly fine. If I run
this, perform action on tool 42
successfully. Being explicit is very
helpful when you're programming. Now,
I've been throwing a lot of shade at
compilers in this video, but compilers
are extremely complicated and super
difficult to write. So, it makes sense
why they would have so many ambiguities
and syntax problems. I actually have an
entire video that talks about like the
history of compilers and how they can be
self-hosted that you should definitely
take a look at. But, if I'm going to be
talking so much smack about compilers, I
should probably try writing my own kind
of compiler, which you can do inside of
C++. So I have created my own custom
parser for my extremely sophisticated
language and this is called
Remmbercript.
So if I go over to my main.cp, this is
going to run my Remmber program. So it's
going to take in the input file that
it's going to execute. And if I go over
to my parser.y, you can see my extremely
sophisticated parser. What this is going
to do, this has a vector of strings that
has really good song lyrics inside of
it. And here are my tokens. I have my
Rember token and my semicolon token. So,
you remember that tokenizing portion of
the compilation process. Here I have my
own custom tokens inside of this. It's
actually really cool that you can do
this. I think it's really fun. Now, if I
go down here, here's the logic for my
interpreter. You can see every time I am
encountering the Remmber semicolon
tokens together I am going to print a
random line of this song. Now let's go
through and let's execute rememberers
script. Now I have my repository here
locally. I've already made my parser so
I can run dot /erscript. It makes me
laugh every time.
And then I need to pass in my custom
remember code. Well first of all let me
show you. Let's cat one of these files.
Let's do test remembermber.
Every time you can see I've got my uh
tokens here. Remember semicolon remember
semicolon. So each one of these is going
to print a random song lyric. So let's
do dot / remembermber script
and let's pass in test.ber.
Here we go. We have our song lyrics that
are printing randomly to the console.
And if we want to get really crazy, let
me show you. We can even do inline
remember script.
So we can do remember remember all in
the same line.
So let's execute this dot / remembermber
and then we'll pass in our test inline
remember.
Here we go. Now we've got our three
random song lyrics both in line and on
separate lines. Tell me you haven't seen
that kind of sophistication before. I
really got to stop making these examples
really late at night with no sleep and
multiple monsters.
I think this was only funny in my head
when I did it. I don't I don't remember
why.
Anyway, all jokes aside, I highly
encourage you to try to write your own
custom parser in C++ because I think
it's really cool to be able to do. It
kind of shows you the challenges of it
and it's just very powerful when you can
write your own custom syntax. I think
that's really nice to be able to declare
your own tokens. Do you remember 21st
day December
always remember happy day? The real
question is what can we do about all
these different ambiguities and
complexities in languages? And actually
formal language theory is a really
fascinating field to study. Now,
although there aren't really many true
examples of contextfree languages, we
can get pretty close. And some languages
are going to be better or worse than
other languages at reducing these kinds
of lexical complexities. A really neat
example of this is ASD STE 100, also
known as simplified technical English.
And this was created by the European
airline industry as a standard form of
English for writing technical manuals
in. And this was created to be a very
easily understandable language for
non-English speakers. You can actually
find this language online. It's super
simple and really neat. It's only
limited to a total of about 900 words.
As for programming, outside of toy and
experimental languages, lisp and lisp
like programming languages are kind of
the clear winners here. Now, a lot of
people really don't like the multitude
of parentheses inside of lisp, but it
makes it really clear and super
unambiguous for the compiler. I remember
my programming teacher in college
talking about when she was learning to
program and she was learning in lisp and
she had to count. She was told by the
teacher to count on her fingers opening
and closing parentheses so that you
could keep track of how many parentheses
you needed to close at like the very end
of a statement which is absolutely
crazy. But other than lisp, ADA and
hasll are pretty good about this. And
while C++ might be my favorite
programming language, historically it's
not great at this. It's had a lot of
lexical confusion in the past. So what
do you think? In this video, I mostly
stuck to C++ and English examples, but
I'm sure there's a ton of other
interesting lexical ambiguities in many
different languages. So, feel free to
leave yours in the comments section
below. In any case, this is a super
tricky field, but it's also very
interesting. So, as always, thank you so
much for watching everyone. I hope you
enjoyed this video. to where he wired
out.
Ask follow-up questions or revisit key timestamps.
This video explores the concept of linguistic ambiguity and how it relates to programming languages. It highlights that just as human languages can be ambiguous, programming languages, despite being created by humans, also present ambiguities that compilers struggle to interpret. The video discusses three main categories of programming language ambiguities: syntactic quirks, semantic issues, and lexical analysis challenges. It provides examples from C++ to illustrate these concepts, such as the dangling else problem, the most vexing parse, nested generics, dependent type names, and the use of the 'template' keyword. The presenter also touches upon the history of compilers and even demonstrates creating a custom parser for a simple language called 'Remmbercript'. Finally, it discusses approaches to mitigate ambiguity in languages, mentioning simplified technical English and Lisp as examples of more unambiguous systems.
Videos recently processed by our community