So, my name is David Sankel.
I'm from Bloomberg, and the topic of this talk is
so you inherited a large code base.
So let's talk about the problem.
So this is you.
You've read all the books.
You are a master of writing the code.
You can make a new API.
You can make a beautiful API, and it is sweet.
Go to the conferences, watch all the best talks.
Basically, when you want to write some code,
it's like an open field.
You start a text editor, it's completely empty,
and you build a beautiful API, right?
Something, template metaprogramming,
really good STL-style code, whatever it is,
you've got this thing down, right?
You're one of the top five percent,
the five percent that can write
a really high quality API, and it is sweet.
I don't know how many of you guys have written an API
and watched that sucker take off and get a ton of users.
It is awesome, super fun.
So, you're a code ninja.
And you've got a new assignment,
and this is to make an existing codebase
into a best-in-class codebase,
and this codebase is big.
This is a different problem, right.
All that stuff that we learned
about how to make a nice API from scratch,
most of it doesn't really apply.
So this is not a talk about how to create
high-quality, maintainable software from scratch, no.
There's another talk I gave about building software capital
that's about that, but that's not what this talk is about.
This is actually a much harder problem.
So let's look at some characteristics of large projects
and see if we can get some hints
as to what we can do about it.
So large projects are usually successful.
There's a reason that they're large.
Someone invested a ton of money into it and it's successful.
Widely in use, ton of users.
Many past contributors.
Usually one person doesn't make a really large codebase,
and there's gonna be varying degrees
of quality and maintainability.
Now, can someone raise your hand if you've worked with
a project that's measured in the millions of lines of code,
and it's been uniform, really amazing quality?
All right, I've seen that codebase.
There are some amazing qualities.
Now, frequently there's mixed styles in the code.
One person isn't doing the same style everywhere.
Those hidden use cases, this is the real,
this is the real pain in the neck.
There's some people using your code
in ways which you do not know how,
and it is critical that that kind of functionality,
which you don't know about already exists in your codebase
stays working that way,
because if you change it, it'll break something.
Hidden use cases, they're a pain.
And partial refactors.
This one's ugly.
How many of you have worked with a codebase where
someone has started the process of changing something
and it never got complete?
Looks like everybody, right, this is really common,
and if you have a large codebase,
this has happened multiple times
over a large period of time, 10, 20 years,
and there's a lot of that stuff in there,
so these are the characteristics of large projects.
What are the unique problems that a large codebase faces?
Those were the characteristics,
what are the unique problems?
Let's talk about toilets.
Toilets are awesome devices when you think about it.
I would venture to guess that
every single person in this room
has used this device at least once in the past 24 hours,
or at least in the last 48 hours, I hope.
But anyway, there was a new innovation to these devices
that came out in the past, I don't know, 10, 20 years,
and it's pretty cool.
What you can do is
when you close the toilet seat when you're done,
instead of having to gently close it,
you give it a little tap,
and it just goes (hissing), closes on its own.
You've seen these things, right?
So nice, you're done with the bathroom,
you just tap that thing and you're out of there.
Well, wash your hands, right, and then you're out of there.
Well, if you get one of these things
installed in your house, it's wonderful, you know,
you just psst do that, and then the first thing that happens
when you go to someone else's house, bam, right?
It's like an earthquake when these things shut.
You may not like shatter the thing,
but it's gonna be seriously dented, okay?
And this leads to the first major problem
with large codebases, and that is,
assumption is the mother of all mess ups.
You get into a large codebase
and you start making assumptions
based on what you've seen previously,
and those assumptions don't work, and bad things happen.
You cannot make assumptions with a large codebase.
What are some dangerous assumptions that folks can make?
You assume that tests cover all the cases, okay.
Can't assume that most of the time.
You assume the documentation is accurate.
Most of the time, you would hope.
You assume that there are certain use cases.
You assume that some kind of migration,
what kind of cost the migration would be for something new.
And you assume some kind of semantics.
These are just dangerous assumptions to make.
When you make an assumption and there's a mistake,
with a large codebase, the mistake has big ramifications.
If you've only got one user, and you make a mistake,
well, the user tells you off or whatever,
and you say, Sally, I'm sorry, you know,
and you do something new, and it's fine.
But if you've got thousand users,
or even if you have a few users but it's extremely critical,
the ramifications are big,
because you don't gain wide adoption on a codebase
unless it sat over a period of time
and it's grown for a certain amount of time,
so mistakes are a big deal.
That means no cowboys.
Cowboys, absolutely not.
Do not let cowboys into your large codebase
because they will screw something up,
but just to be clear, ninjas are fine.
So another common mistake with codebases.
So let's say you're working on your house.
Maybe you're installing a lamp in the ceiling
or something like that, and then you go to the garage,
'cause you need to get a screwdriver, and you see this,
and you think, hey, maybe I should clean this up,
and then, you end up spending hours,
and hours, and hours cleaning up the garage
and you never actually get to the point to where
you install the new light fixture in the ceiling.
This happens in a codebase the exact same way.
Now let's say you've got a bunch of components
and that one in the yellow there, that's the gold one,
that's the one you need to make a modification for
because there's some kind of core business value or whatever
and you're modifying this component,
and you see all the things that it depend on, and you think,
boy, there's a lot of cleanup that can be done around here,
and maybe you want to do that.
Of course, you've got a huge codebase,
and it's been sitting there for years.
Of course, technology has progressed.
Of course, you have progressed,
and you're gonna see a ton of things that can be improved,
but it's easy to get distracted
and start working on the things that you see, right.
Is that really the highest priority thing?
This is another really key
common pitfall with large codebases.
You can end up doing a refactor
that has a questionable benefit-cost ratio.
You might do a drive-by fix that has
huge implications without realizing it.
Something to keep in mind.
What is this thing?
I don't know, but it looks cool.
It is shiny, and I think I want some of that for my project.
Shiny new things.
This is a word I want you all
to insert into your vocabulary.
Neophilia, it's a real word.
An attraction for things that are new, love of the novel.
We all have it, right, as human beings,
new things are new, and they're cool.
However, when you're trying to prioritize
what you're gonna do with your large scale project,
neophilia should be accounted for in a negative way.
It should not come into the decision process
as to what you're going to do with your large project,
what you're going to change to,
but you gotta recognize that you got it, so neophilia.
Maybe there's competing technologies.
Someone says, hey, this thing's new.
It came out from this fancy company
and we'd like to adopt this everywhere
instead of the other thing.
Hey, that happens all the time.
Maybe a new fancy build system, new tools.
Coding styles and standards, idioms and techniques.
Now I'm not saying something is bad because it's new.
That would be neophobia.
What I'm saying is that you have to be aware
of a natural tendency to want things that are new
just because they're new.
You have to weigh the factors, pro and con,
for adopting a new technology,
especially with a large codebase
because adopting anything costs a lot.
All right, final thing, baseball.
How many of you have played baseball?
Okay, many of you.
Maybe it's not so common outside of the United States
but it's a really cool game,
and when you play baseball or softball and you get a coach,
they always tell you something when you do the bat,
because when you swing the bat,
what people have a tendency to do is they swing it
almost all the way, and then they ting the ball,
and it doesn't go as far as you want,
so what does the coach always tell you to do?
Follow through, right, all the way through.
Finish the thing that you started,
and lack of follow-through is a big deal in large codebases
so let's do a fictional story of follow-through.
In 2000, a developer decides
to switch test frameworks, fine.
2002, the transition is 90% done.
That's awesome, 90% done, but priorities are shifted.
The developer goes on to do something else.
2004, a different developer
decides to switch test frameworks.
And in 2006, the transition of the 2002-style tests
is 90% done, but priorities shift.
Then in 2008, a developer decides to switch test frameworks.
You've got the idea.
If you don't follow through,
how many replicated technologies
are you gonna have in your codebase?
It'll be a function of time,
and this is a really big deal.
You've got to be able to follow through, or give up.
So follow-through, inability to follow through
leads to multiple, redundant technologies
that increase with number over time,
and what you end up with
is something like a Frankenstein's monster,
but even worse than Frankenstein's monster
because it'd be like a Frankenstein's monster
where one arm was like from one person,
another one's another person.
They don't even like line up right.
Bad, and a partial change is destruction of value,
and this is something so key to keep in mind.
It's just like if you have a kitchen, you're remodeling it.
You gotta take out stuff from the old kitchen
and you can't use it for a while, right,
it's destruction of value.
You don't get to a higher value place
until the new kitchen is put in
and everything is all functional.
So anytime you do a codebase, and you take something which,
if you do some kind of code change with your codebase,
and you take something and you're gonna
replace it with something else,
that intermediate stage is a destruction of value,
and it's important to recognize
and keep in mind that that's exactly what it is.
So here we got the four pitfalls of large projects,
neophilia, and lack of follow-through.
All right, enough about the problem.
How do we deal with this?
So day one, establish your ground rules.
New code must be software capital.
Unit tests, peer-review, contracts,
documentation, et cetera.
That has to be the rule,
and I'm not gonna go into the detail
of what software capital looks like.
There's another talk about that,
and many other talks at the conference.
This is something we already know how to do,
develop really high quality stuff, from the get-go.
Make a rule, tech debt may not be
introduced into the code base.
It's very easy if in a given sprint,
you make the decision that tech debt
cannot be introduced in the code base.
You look at the end of the sprint.
Is there tech debt being introduced?
If yes, then revert it.
Do not allow it to happen.
And total quality.
Total quality, now this is somewhat controversial,
but what it means is the same amount of rigor
that you apply to your most commonly used API,
in terms of unit tests and code reviews and whatever,
applies to the most minimal, tiny little thing
that doesn't matter at this point, and the reason why
is because you don't know what code you're writing today
is gonna be extremely reused in the future.
You just don't know, so it's better not to take the risk
and write it all to the same quality.
I even have unit tests for some of my build code.
That's what I mean by that.
And then the second thing to do is
to create an infrastructure team.
If you've got a large code base,
which is measured in the millions of lines of code,
you need to have somebody or some group
responsible for that entire code base,
and establish continuous integration,
and coding standards, and tools, and so on and so forth.
And the thing to keep in mind is that your general approach,
whatever you decide is your main operation
is gonna define what the code is gonna look like
in the next five years.
If you allow a little bit of
tech debt in every once in a while,
that's gonna add up over a period of time
and you'll get a bunch of tech debt at the end of it.
Are you destroying value, or are you creating value?
These are things to keep in mind.
It's like the butterfly effect.
A little decision that you make that you do habitually
is gonna have a big ballooning effect
in terms of what the code is gonna look like
years down the line, so keep all that in mind.
Automation is essential.
Day 1.5, clang-format your code base.
I like this because it's a good, early success.
This is an operation you can apply to your entire code base.
It's low risk.
I've never heard of a clang-format
which actually changes semantics.
Now if you're putting stuff in your comments,
or something like that, and it has semantic meaning,
please stop doing that,
but don't run clang-format on your code base.
Fix that first.
And this sets the stage for future refactors,
because if you're going to be making
wide changes to your code base, you need to have
some kind of automated formatting, otherwise,
the code's gonna end up looking really weird.
So, I want to show you what it looks like
to do an automated code change to your code base.
There's been a lot of talk about using clang
to do automated refactors
but I want to show you how simple it is.
It really is not hard.
What we're doing here is we're using
the Python framework that's distributed with clang.
You just install clang, and then you start Python,
and you can import these modules,
and here's an example where we have
a find child function which is being defined.
It takes in a parent, which is a node in your parse tree.
It takes in a kind, which would be like
a function declaration or a switch statement,
or something along those lines,
and a spelling, which is a string,
and what it does is it gets the children of the node.
If the kind is the node's kind
and the spelling is the node's spelling,
it returns the node.
Pretty basic stuff, right?
This is not hard to implement.
Let's say you want to use this function we just generated
to find the abstract syntax tree node for main.
This one takes in a translation unit, okay.
This is something that clang will provide us,
and we call find_child on the cursor
associated with the translation unit.
This is the top level cursor in your translation unit.
The kind is a function declaration.
Main is a function declaration,
and the spelling of that function is main, that's it.
You call that, and now you have a cursor to main.
If you want to take a cursor
and you want to dump its output, here's a way to do that.
You just get the corresponding tokens and print it.
We've got some code that'll, when it does a refactor,
it'll just dump the tokens, one per line.
It's ugly as can be, right?
It's just a huge set of tokens.
Run clang-format on that thing.
Boom, you're as good as new.
This stuff is easy, and we've been able to do
refactoring of tests, like switch complete test frameworks
by running one of these scripts.
The script itself was, I think it was only like 500 lines.
You just gotta try doing this the first time.
Do it the first time, get the expertise,
and then you'll get more comfortable with
what kind of changes you can apply to your large code base.
And this here is just the boilerplate
you need to do to put it all together.
How do you get a translation unit?
You can pass in a compilation database directory.
If you're using CMake,
CMake is gonna generate a compilation database.
You just point it to your CMake directory.
What this compilation database does is
it provides you with the command line arguments
that you would use to compile something
for every single piece of source code
that's being built there.
If you don't want to use the compilation database,
you can just specify what the command line should be
for clang to interpret this file.
So in this example anyway,
you're looking at the compilation database,
you're getting the commands and the arguments,
and you call parse, get a translation unit
and you can get diagnostics.
These are like the warnings or the errors or whatever,
and you can print those out and return the translation unit.
That's it, this ties it all together.
It doesn't have to be a big, scary thing.
Now I know that there's also a C++ level API for clang,
and it has some more powerful things,
but that requires compiling clang
or using a clang library and
getting all that whole build set up.
I think the Python's a lot easier
to get started with anyway, even if it doesn't have
all the same features, it has enough.
So general infrastructure automation.
Here's some things to automate.
So, refactors, these are what
we just talked about using clang.
If you change your coding standards, you want to be able to
apply the change to your entire code base,
and you can do that with these kinds of automation.
Coding conventions, documentation.
If you have some boilerplate that you have to do
for your documentation in your large scale code base,
you should automate that.
That way, you get it right, and it's not such a burden
for developers to work with.
One of the things I have seen is that if you put
a lot of constraints on what it takes to make a new file,
then people will have a tendency to be lazy
and want to add stuff to the old file.
And now of course, you could just say,
well, they should not be lazy and they should just do it,
but I would like to make it so
it's easier for them to make a new file,
like lower the burden for writing really good quality code.
Any kind of manual developer tasks that people are doing.
Try to automate that thing, and indexing the code base.
So kythe is a really promising technology.
It can index your entire code base
using the clang parser and that kind of functionality,
and you can ask, who's calling this function,
and get an answer to that question.
That's very useful when you're trying to decide
how you're going to refactor a large scale code base
because you don't generally have
information like that at your fingertips.
So this is my daughter, Carmen.
She just recently turned this many years,
and what she's standing next to is a pile of books.
Does anyone want to like throw out a guess
as to how many lines of code that pile of books is?
I guess we don't think about this enough, right?
That's one million lines of code, right there.
About as tall as my daughter,
and that's a lot of information.
Can any of you retain that much information
in your head at a time,
especially like technical information about code?
Of course not.
Can't do that.
To be an expert in that amount of information,
you're only gonna be an expert in a slice of it,
so when you're working with a large scale code base,
you can't be expected to know everything.
That makes it all the more important to be able to
organize your code to be able to think about it
from a higher level perspective, like indexing.
So reasoning about your code base,
and navigating your code base.
If you're using an editor that doesn't allow you
to take a function that's being used
and go to its declaration, like don't talk to me,
like you have to be able to have the basic tools
on how to navigate your code base.
Now if you're using Vim, you're using the right editor.
The package-level and group-level documentation.
I know it's like pulling teeth sometimes
to try to get someone to document a function.
You need to pull those teeth even harder.
That class needs to be documented.
That component needs to be documented.
That library needs to be documented,
and if you have collections of libraries
like we do at Bloomberg called package groups,
that thing needs to be documented too.
Otherwise, it's gonna be hopeless
trying to figure out what you have in your code base,
so this is really important.
And the final thing here is levelization.
This is a way for you to understand
the dependencies of your code base.
So this is a levelization example
coming from the BDE libraries.
I'll explain what this means.
All of these things are part of a package group called BSL.
That's Bloomberg's implementation of the standard library,
and so each of these packages are in this package group,
and at the very bottom level, level one,
bslfwd and bsls, what these packages do is
they don't depend on anything else in this package group.
They're at the very lowest level,
and they don't depend on each other.
At the next level, bslscm,
this only depends on the stuff beneath it,
and so on and so forth, so if you're looking at a code base,
and you have a levelization like this and you just say,
okay, I really want to understand this thing.
What is the reading order?
Well you start at the lowest level,
read the documentation for bsls and fwd,
then you can look at the stuff at level two,
level three, level four, and so on and so forth.
You could even have documentation at the group level
which says, you know what?
You really, if you're looking to use this thing as a API,
you don't have to look at level five and below.
Just look at the stuff at six and above,
and those are good places to start.
This kind of documentation,
this kind of indexing of your code base,
it needs to start somewhere, so I highly recommend
if you're working with a large scale code base
that doesn't have this kind of information to add it,
and then you'll be able to work with it a little bit better.
And by the way, Beezle.
We have, at Bloomberg, an open source thing
like a Beezle that's called BDE,
and if you want to see what really high standards look like,
check out BDE, and if you want to see,
nevermind, just compare it to Beezle,
and you'll see what I mean.
These are really fun.
So there's a story, an ancient story
about a tower called Babel,
and these people tried to make
like this tower that goes all the way to the heavens,
and God got upset 'cause they thought
they were so cool to build something like this,
and he made them all speak different languages.
They couldn't speak to each other,
and the tower eventually fell apart
and the people dispersed, and that's where we got language.
That's how the story goes anyway.
Well these people did not have mnemonic methods.
If they had mnemonic methods, this would not have happened.
So I'm gonna explain what mnemonic methods are.
Here's an example of a factual statement about a code base.
I'll read it, quickly.
Process control response passes the request
to Gateway Manager Process Control Response
which then calls addActiveRoute.
If d_authorizations_p doesn't have the request,
it is added to d_pendingRoutes and
control is returned to the processControlResponse.
Process control response then checks pending routes,
and if necessary, updates d_authorizations_p
once this is done.
Process control response and thus
add active route is called again, but this time,
d_pendingRoutes is set and the task is accomplished.
Look, you can't retain this stuff in your head.
Like my head just wants to push it out
as soon as it comes in.
So mnemonic methods gives us a way to comprehend this stuff.
So I'm gonna tell you a story about a king.
There's a king, and he needs to get some kind of errand done
so he goes to the butler, and he says,
Butler, I've got to do an errand.
Butler goes to the manservant, says,
Manservant, go do the errand.
Manservant goes to the garage and alas, the car isn't there,
so the manservant goes back to the butler,
and he tells the butler what's going on,
and then the butler goes to the bathroom,
and he writes a note on the wall
saying, the van isn't in the garage.
Then the butler goes and sits down.
The king saw the butler sit down,
so he went to the bathroom and read the note on the wall,
and saw that the van wasn't in the garage,
so he got the van to go in the garage,
then the king went back, sat down,
and told the butler, go do the errand.
The butler then told the manservant to go do the errand,
and the manservant saw that the car was in the garage,
and did the errand, got it?
So who wrote the note on the wall in the bathroom?
The butler, right?
I mean, it's a crazy, ridiculous story, but we all get it.
You only hear it once and you remember it.
This is the way our brains work, right,
we were very good at understanding stories,
and we're very good at retaining stories
that have crazy information, and let me tell you,
it's hilarious watching people walk by
when you're having a heated discussion
about who's gonna put the writing
on the note in the bathroom.
But anyway, this allows us to be able to
use our human brains to be able to comprehend things
which are technically really complex, mnemonic methods.
So mnemonic methods facilitate comprehension
and communication of complex interaction for humans.
Works very good for humans.
The stranger the story, the easier it is to be retained,
and it must have a concrete mapping to code to be useful,
so for example, the king, the manservant, the butler,
they all correspond to very specific pieces of code
in that statement that I showed you earlier,
and then you can actually have
a decent discussion about these things.
Find bugs this way, actually.
So another important thing to do is to gauge difficulty.
You have to realize that when you're working with
a large code base and you make some kind of change,
that there are easy problems and there are hard problems.
What are the characteristics of the easy problems?
These are the ones that we can solve.
Strictly additive, if you just need to add a new class,
inherit from something else and specialize it, easy.
Pure functions, you know,
pure functions don't deal with global state,
no side effects, those are easy to work with.
Few interactions, simple semantics,
small components, and uniform dependents.
What I mean by that is everybody who's using your code
is using it in the same way.
That's great because that means
if you need to make some kind of change in interface,
it's very easy for you to go
and refactor all those users of it.
Hard problems involve reworking existing components,
things that have globals and side-effects,
many interactions, complex semantics.
By many interactions, I mean interactions between components
like you can't really understand this one
without seeing how it interacts with this one
and there's a big discussion about
how they all talk with each other.
Large components are hard to work with,
and diverse dependents.
This is the big deal.
If you have a piece of code and it's being used
in many different ways, it's much harder
to be able to refactor all of your users
to use a new interface, so just keep it in mind,
there's a spectrum of challenges,
and when someone proposes
some kind of change to the code base,
you see what kind of characteristics does it have,
because the stuff on the far, hard side,
we don't know how to do these things,
and the stuff on the easy side, we know for sure,
and there's all this range in between,
something to keep in mind.
Now measure instead of assuming.
Assumption is the mother of all mess ups,
so we've got to measure as a way to mitigate that problem.
What are the clients doing with your code?
Don't guess, look, measure, figure it out.
You've got the tools to do this, do it, be informed.
What's the impact of the code change gonna be?
This is something you can measure.
Instrument the code.
Like, a lot of times, you can just
figure out what the clients are doing
by looking at how they're calling your function,
but you don't know how often they're calling it.
You don't know what kind of constraints they have
in terms of like what's the load
on that particular function.
You can instrument your code
to answer these kinds of questions,
and then you can make more informed decisions.
And absolutely, choose your priorities wisely.
Gauge the difficulty of the problem,
and start with easy problems to build confidence.
If you're gonna start with some
major refactoring thing, don't do that.
Start with something easy.
Start with something to build your ability
to be able to make changes to a large code base
in a successful way.
What is a not success is starting something
and stopping halfway through and never restarting it again.
And of course, you've got to consider the business value
before embarking on a project.
Let's say there's some newfangled library
that everybody wants to use.
What's the cost of that gonna be to switch to that library?
And what business value does it create?
These are questions to be answered,
and of course, invest in the future.
Anything that you can do which is going to make
day-to-day life easier on a project, just do it,
especially for a large scale project,
if it's lasted this long, it'll probably last twice as long.
And I will say, and this is really important,
it's okay to give up and undo a started attempt.
If you start to do something, and it's,
you realize at some point,
it's gonna take longer than you thought,
or it's more complex than you thought,
it's okay to give up, but then, undo what you did,
and it's okay to do that even if
the cost of doing so is significant.
It's worth it in the scheme of things
because having a halfway done thing
causes an awful lot more damage than starting something
and then going back to, not being able to finish it,
and then going back to the starting point.
So, for people on my team,
they see me ask this question all the time,
any time something is suggested.
What's the migration path for that?
We tend not to ask these questions, right,
because we're developing new APIs from scratch.
There's no, like, migration path.
You build a new thing and then,
all the users come and switch to it, right.
But in reality, with the large projects
you have to have migration paths,
and you've gotta be able to answer that question.
So most stalled efforts have flawed migration paths.
Most of the time, nobody really thought about it.
Something that's nice for a migration path,
a nice characteristic to have is to have
migrate a small change everywhere,
so if you have some grand vision of where you want to end up
if you can start with a small change,
migrate all your clients and everything to use it,
make another small change,
migrate all your clients to use it,
that process seems to have a lot more success,
because if you have to stop somewhere in the middle,
you've already added value to that point,
whereas if you build a new thing in isolation,
and then you slowly have all the clients
adapt to this really strangely new thing,
man that'll fail, and it fails very frequently.
I haven't seen much successes
with that kind of migration plan.
And one thing to keep in mind is
that your code transformations,
when you write these clang refactoring tools
that are specific to your project, you have to prove that
the semantics of the old meet the semantics of the new
because you can't really depend on everybody,
you know, running all unit tests
to make sure there's not a bug in your code,
but you can assume they're gonna test
to make sure there's a bug in their code,
but maybe don't even assume that,
but you can mathematically prove your code transformations,
and if you do that, you're good to go.
And don't be afraid, I mean, I did use the word Math there.
Reasoning about code is something we do all the time.
You can take a look at a piece of code transformation
and you can show that this is indeed
equivalent to the original.
It's not that big of a deal.
And of course, consider timeline
and changing business priorities.
Something that like management says
is a really big deal right now,
like maybe there's a two year project,
like this is, no, this is really big.
Are you committed to this?
Oh yeah, we're committed.
When you get one and a half years down the line,
all of a sudden there's a new big deal.
That's a risk, and if you're one and a half years
down a partial refactor, you've got a problem.
You're probably better off with the original
and not having started in the first place,
so this is something to keep in mind
when you're prioritizing that management things will change,
so prefer things that you can do on a smaller time table,
and that'll minimize the risk to a certain extent.
And here's some beautiful words.
I don't know how to make a migration path for that yet.
I like this because this introduces
a bit of humility into the mix,
and it also introduces hope.
Maybe there's some new technology.
Maybe there's some great idea that someone has
for making a big change in your large scale code base,
but they don't have a migration path.
You can just say, I don't know
how to make a migration path for that,
but maybe at some point in the future
we'll be able to figure it out,
but until then, we're not gonna do it, right.
Now, replacing a piece.
Whenever you refactor code, you're taking some piece
that existed and you're replacing it with something else.
How do we do this?
First thing, draw the borders.
What is your piece?
Is it a function that you're replacing?
Is it a component?
Is it several components?
Is it some combination of components?
You gotta figure out what the border is
around the thing that you're replacing.
Once you figure that out,
now you can fully and precisely comprehend
the semantics of the piece that you are replacing.
Using mnemonic methods really helps.
Build some kind of a mental model as to what this does.
Then you take this piece and you surround it with tests
verifying its existing functionality,
verifying your understanding of this code
and what it does, what the use cases are,
and these are unit tests,
integration tests, functional tests.
That's something for like scientific communities,
and then you implement the replacement,
and wrap it with the old interface,
so all the code which is using the old thing doesn't change.
It's just using the new thing
wrapped with the old interface.
Then you can, finally, safely and completely,
adapt the old code to the new interface
and remove the old piece.
If you get to the end, you're done.
It was a success.
If you stopped or halted somewhere in the middle,
it was a failure, but that's how,
generally, you replace a piece.
It's really not rocket science.
So what kind of code lends itself to
10, 20, 30-plus years of evolution?
You know, well, I'm just curious.
How many in the audience are working with a large code base
which they themself did not develop?
I should have asked the re-risk question.
I don't believe you.
This is only gonna increase, right,
as the amount of C++ code grows,
and the amount of core infrastructure
this world is living on gets solidified,
most software developers are gonna be working on
large scale code bases which they did not develop.
We've got to get good at this, and we gotta write code
that's gonna be nice to the next person
who takes over this code base that you're working with.
So what kind of code lends itself to good evolution?
Consistent code does.
If you have coding standards,
and to the best extent you can, avoid abnormalities,
weird things in the code, clever little ideas.
These things hurt if you're trying to have
a long-term code base.
Organized and well-documented code
lends itself to large scale evolution.
Stuff that's cataloged.
If you use contracts consistently.
Code which is readable, so you're writing the code,
I mean, you know these rules.
You write the code for the reader, right,
not for the compiler and not for you.
The next person that's gonna read your code,
that's your audience.
Test code with good unit tests,
and integration tests, and functional tests
is a pleasure to refactor because you change something,
you rerun the unit tests and you figure out what broke.
That's extremely helpful.
Code which uses the right abstractions.
Now this is something that's interesting
because the computer scientist in us
wants to make the most generic thing possible,
but that may be more generic than what we really want,
so you want it to be generic enough to meet
all the use cases, but not so generic
that it encourages diversity.
You don't want to have a million different ways
to use your code if only a few will suffice.
It sounds good in the theoretical sense
to make it really generic, but in a practical sense
with a large-scale code base
that you want to evolve over time,
you don't want to have diversity of uses,
so ask yourself the question, how hard will it be to fix
the unforeseen design mistakes
which you are making right now.
You want the answer to be, not too hard.
Another important thing to do is to follow the pack.
Use industry best practices,
standard libraries and tools.
You know, we all want to innovate.
Innovation is fun,
but if you're working with a large-scale code base,
what's the value of innovating something?
That's a question you gotta ask.
Let's say you find that indenting three characters
is somehow vastly superior in some way
to indenting four characters, or two characters,
whatever happens to be the standard.
Is it really worth it for you to go against
the entire industry and do something strange?
I think not.
It's better to just forget about those little things
and if you're gonna innovate,
innovate in things that matter for your business use case.
So we gotta weigh those long-term costs.
And collaboration between companies is nice
and that's like what the standardization process is about.
If I'm using standard reactor,
Google's using standard reactor,
Facebook's using standard reactor,
we all have vested interests in evolving this
to the future in the same way,
whereas if we each have our own little world that's isolated
we're not gonna be able to pool our resources.
So part four of this talk is about an observation,
a really simple observation, and that is
that C++, as a programming language, as a library,
has the same properties as a large piece of C++ software.
And I'm on the ISO committee, so this is pertinent.
C++ has approximately 81,000 lines.
That's dinky, so it doesn't, it's not extremely large,
although it is dense,
but there are many diverse dependents
of the C++ language and library, right,
so evolution's a big deal
when you have that many dependents.
So it has all the same characteristics of a large code base.
Successful, widely in use, many past contributors,
varying degrees of quality and maintainability,
mixed styles, hidden use-cases.
These always bite us every time we do a release, right,
when someone's like, hey,
I didn't realize this was an ABI breaking change,
and partial refactors, although I will say,
partial refactors, generally, on the C++ committee,
we're pretty good about not allowing
something half done into the working paper.
That part, I think we got down.
By the same, I heard future.
I stand corrected, I stand corrected.
The same pitfalls apply.
Assumption, it happens all the time on the committee.
Someone assumes that everybody is using C++
the same way that they're using it.
Distraction, this is a big deal.
It's a large language.
There's a lot of things that could be tweaked and fixed,
but are we spending our time
on the things that are really important?
You know, are we spending our time
on things like modules, which are really important.
Neophilia, there are so many cool things going on
with the other languages out there.
Man, it's so new and cool.
Is it really worth it for us to be
incorporating these features into C++?
We gotta ask ourself without letting ourselves
get carried away with neophilia.
And lack of follow-through.
Yeah, this one,
this one happens but it doesn't really affect too much,
except for that when someone assumes that,
oh, this person is working on this other proposal.
I'm gonna assume that that thing's gonna get done,
and I'm gonna depend on that one.
That can sometimes bite us.
So my key questions for the C++ committee are these,
and these are bigger things.
How can we automate changes?
We change something in the C++ standard,
is there a way that we can automatically
update all of the users to the new thing?
We haven't been doing this, but I think we should.
How can we make the standard more approachable?
There's only like a handful of people that can understand
that crazy language they use in the standard.
I think this isn't good.
Like, it's good that it's precise,
but if it's not accessible, I think that hurts us.
What are the easy problems and what are the hard problems?
We gotta ask ourselves those questions.
How do we measure cost?
How can we prioritize appropriately?
And we've been doing some things to
to try to help us prioritize better.
And what does the complete migration path look like?
We don't ask ourselves that very much.
And fostering industry collaboration,
and this actually does happen.
It's great to see companies get together
and work jointly on proposals.
So in conclusion,
when it comes to working with large code bases,
preexisting large code bases,
ugliness and everything in there,
I feel like we're really at the beginning.
We're not at the end.
There's way more that we don't know than what we do know,
and I think it's extremely important
that we as a community start talking about this,
and start trying to figure out ways
that we can migrate these old code bases.
We gotta stop pretending, like,
computer science is all about writing
a new, spiffy API from scratch.
It doesn't cut it now, and it's gonna cut it
even less in the future.
That concludes my talk.
I'll take questions now.
- [Questioner] Hey David, great talk again.
I have two questions for you.
First question is about replacing a module or a component.
You say that I should, like, keep an interface first
that's the same as before, and then trash it
because I can make whatever I want,
but what if the old interface makes no sense,
like completely bizarre behavior,
that I certainly do not want to reproduce in the new one?
So, some problems are harder than others.
You define that as a very hard problem.
I don't know.
I don't have like a magic answer to that problem.
You have to look at the cases,
and maybe the answer is trying a different approach,
but you just gotta figure it out.
- [Tony] Can I just throw in there?
Like, you were saying that,
you call the old, you mock the old API,
like call the old API, that's just temporary.
That's just to see that it's working,
and then you can carry it over and tear it apart.
Yeah, Tony says that you can mock the old API,
which is just temporary and then you can tear it apart,
but I think that the question was raised is
the old API is so ridiculous that
you wouldn't be able to make a new thing
and adapt it to the old API, so, yep.
- [Questioner] Second question.
I like that a lot of techniques are like
something I can approach on my own on the components I own,
but I think the first two or three slides
and first ideas or thing I should start with
are company-wide, like, deciding
for the standards or for the formatting, all that,
so would you accept to lend John to my company
for a few weeks so that we do that, or how can I do it?
Did you ask me if I could lend John
to your company for a few weeks?
- [Questioner] Yeah, I did.
Once he finishes his book.
- [Questioner] But on my level,
if I can make that decision alone.
So you have your realm of influence, right?
And that's just a reality of being a software developer.
You can only influence what you have,
but what influence you do have,
that's your area and that's where you can do things.
If you want to improve things company-wide,
you're gonna have to get more power,
or collaborate with other people
who have similar ideas and get it done.
I think you can make the business case that
this is good for the organization in the long term,
and that case has to be made and understood by managers.
If they're too short-sighted about these things,
then I mean, what's gonna happen?
You're gonna have a code base
which becomes unmaintainable at some point.
- [Questioner] Yeah, I got it, thank you.
- [Questioner2] Hey David, again, a awesome talk.
Is it possible that we can get the slides?
Absolutely, the slides will be put online
and will be accessible next to,
and there'll be a recording of the talk as well.
- [Questioner2] Awesome, so,
so you touched on like if we have a piece
and we want to modify the piece
or replace it with a different piece,
what if you have hundreds of pieces
and they're like spaghetti,
like they're tangled with each other
which probably happens in large code base,
how do we proceed, like which, like what, like so,
if you want to replace all that spaghetti
with brand new code but we have partners
using that spaghetti code right now,
what should we do in first step, for example?
So when you got spaghetti code,
you have to be able to understand it
to be able to work with it,
and that's where mnemonic methods really help.
It's a hard problem.
The best I can say is, try to use the techniques
to the best that you can to apply it,
but maybe we don't know yet how to solve
a certain level of difficulty in terms of problems.
Try something easier first.
- [Questioner2] Okay, thanks.
- [Questioner3] Thanks, really great talk, like always.
How would you address the counterargument
to reformatting code which is, it breaks, get blame.
That's what I keep running into.
Okay, well, a counterargument to that is
you can put a list of the commits
that you did your formatting with and then
get blame can ignore those particular commits
and then, it's almost like it didn't happen.
- [Questioner4] I was curious about
kind of the large-scale reformatting refactoring,
how do you deal with that when you have a company that's
basically working on multiple branches at one time,
and they have to be able to merge
back and forth between them?
Okay, so you've got multiple branches
going on on the same time, so I don't know
what kind of version control tool that you guys use,
but one thing that works with get is
you can apply the patch to the branch and the master,
and then, when they come together, it'll be just fine.
There's not a real issue there.
Like in particular, if you're doing,
when you clang-format everywhere, something like that,
if you have a branch, you can apply the clang-format
on every single commit.
You're basically rebasing it,
but applying it to every single commit along the way,
and then, it just kind of works cleanly that way,
so that'd be one suggestion I'd give you.
- [Questioner5] Hi, could you give an example of
the refactoring that you did using the Python client API
because the last time I looked at using clang
to do refactoring on, the AST matchers weren't really there
as well on the Python API, and I'm kind of wondering
what sort of power you were able to get out of it?
Oh, okay, so the AST matchers, as far as I know,
are not in the Python API,
so we implemented our own similar kind of thing.
It's not that hard.
As long as you have access to the AST,
and you can navigate the nodes,
that pretty much gives you the base
of what you need to be able to do anything,
so if you start with a small refactor
and you write software capital,
you'll end up with some tools that you can use
for bigger refactors later, and eventually,
you'll have a code base which builds on top of clang
that you can use to do more complex
and sophisticated kind of refactors,
and you may end up implementing something like
that matcher stuff in Python on top of the
lower level Python API you got.
- [Questioner5] All right, thanks.
And then, open source it, as Tony says.
- [Questioner6] So some of us have code bases
which are subject to review by regulatory agencies,
and so, doing the incremental approach,
although it's the sane way to do it,
the cost overhead of going to a particular point
and then saying, we're going to go to
whatever government agency, please validate this
so that we can actually,
so that we could actually release it
is not really a tenable approach.
What would you recommend in terms of
having those factors in line with being able to get to
an improved code base, shall we say.
Okay, so if I understand it correctly,
you have like a library which you're developing,
and you can't validate it.
It's like very expensive and hard to validate it.
Is that right?
- [Questioner6] That would be the effect, yes.
Off the top of my head,
I can't think of something that would be
particularly helpful there.
I don't know how to solve that problem
except for politically, you know, to try to get,
oh, I see hands coming up,
so I'm gonna let the experts answer this.
- [John] This happens with Bloomberg all the time.
And so, what we have to do is
we have to batch up our changes
and then it has to be evaluated,
but you do have to have it evaluated,
so you simply do the refactoring
and the forward moving at the same time.
You would have to hold them off to the side.
You would do them at the same.
You're gonna have it evaluated, so you're gonna do the
refactor evaluation and the forward moving evaluation
at the same time, that's it.
So John is saying that we do the refactoring evaluation
and the forward moving evaluation at the same time,
and kind of like get these things done in parallel.
- [John] Right, you have to follow the regulatory timeframe.
- [Questioner7] Okay, the comment,
you were talking about large code bases,
but you said the wisdom in the beginnings is not new
and it applies also for smaller code bases,
like broken window effect.
I'm not sure if you, you didn't use that term,
but that was in the Pragmatic Programmer,
the broken windows effect, and exactly that happens
if you have small negligences
or you end up with your garage.
- [Questioner7] And for those who are not aware
of that book, get it, appreciate it.
If you are new to test automation and stuff like that,
get the Pragmatic Starter Kit and appreciate it,
even if you are just doing small stuff.
- [Questioner8] So on the previous point,
I was gonna add to that that
I think the major problem there is
you have some external force which is
making the cost of iteration high,
and that might be a government regulatory body,
it might be a platform regulation thing,
it might be that your code is on
millions of machines all across the world,
and you have to find a way to update them,
and not everyone wants to patch everyday,
so that's a kind of a major problem
in building capital, I guess.
Would you say anything about that?
All right, well if there's no more comments then we're done.