Cookies   I display ads to cover the expenses. See the privacy policy for more information. You can keep or reject the ads.

Video thumbnail
so I'll be talking about arrays and particularly beyond numpy arrays and
where I'm coming from this coming from in this is I'm part of an NSF program to
improve software used by high-energy physicists so so that's my my charge in
all of this that's where all this is coming from and where it's motivated but
I think that what we'll be talking about here is a broader interest than that
however since it is the motivation for the problem I should at least start by
talking about what is it that we're doing and so you've probably seen
pictures like this some people might have it like you know sometimes
background you're a laptop or a poster or something it's a it's a photo of some
atomic particles traveling through a detector and this one is from 1964.
A group from Berkeley scanned through a hundred thousand photos to find this one.
This one is special - it's the discovery of the Omega baryon, which was a
pivotal point in discovering quarks. so then look at that you see it? go through
a hundred thousand photos that look more or less like this and in order to find it
well it helps if you draw the lines on top so each one of these pictures has to
be interpreted so you see all those tracks on there most of them don't have
to do with the event of interest and then it's a matter of finding the ones
that do and interpreting it and the interpretation is itself pretty
complicated this is an Omega decays to Xi decays to a lambda to two
photons and pions and protons and you know and you have to go through a
hundred thousand to find these so just in a nutshell this is what particle
physicists do even today this is what we do we collide particles produce new ones
and take pictures of the decays so we can interpret them it's you know what
what have we produced and unfortunately these part of these
pictures come unlabeled so here's a modern one you know so you look at that
you say oh yeah I see something maybe so since you know the old days things have
basically just scaled up it's the same kind of problem but going from
photographs to digital scans the number of events is now you know at the
trillion level obviously we're no longer manual or semi-automated it's fully
automatic even machine learning is involved in that and also in a number of
people involved this is huge so how do you algorithmically find decay such as
that that chain we started with well it involves loops you loop over all pairs
of particle tracks and you just tentatively call them pi plus pi minus
just assume that they are calculates the special relativity mass that they would
have if they came from a single particle decay and if that mass is close to a
known mass they will pile up you know the ones that are truly that decay peak
at the known at the true mass and the ones that aren't and they're just random
junk they're everywhere else so so that's it you just do that all the way down the
chain such as for instance Higgs to ZZ, Z to EE or Z to mu mu so for the Higgs
discovery mmm six years ago and then shows up as a peak
brand-new peak okay so from a computational point of view so that was
the motivation now let's think about the computation one of the great advantages
we have is that all the events are independent so no particles from one decay
will live long enough to to be in the next decay or at least it won't stay in
the detector long enough it'll fly away
the detectors produce collections of different
kinds of signals tracks like we've seen but also energy depositions timing and
then you know when we partially analyze these we put them in different buckets
these are the electrons these are the muons these are something else and each
of these has a variable length collection variable length per event if
we were to do this with SQL and we don't this is kind of the mindset that it
would be the decay candidates can be thought of as like a self join if we're
combining two things in the same collection or like a cross join if we're
combining things from different collections but here's the kicker
always-on event ID 1 equals event ID 2 it's always in the same event and that's
basically why we don't use SQL we'd be saying that every single with every
single join and then after we found candidates the good candidates are
identified by filtering throwing away the bad or some sort of a reduction like
in the functional programming sense to find the best of what remains like
maximum sum something like that so here's sort of the diagram of how it
goes first we'll have some sort of per event to join and then we map some
calculation across all of those candidates we filter to pick the best
ones you know we filter to find that the ok ones and then we reduce to find the
best one independently for each event
another key, I can't say this enough it's a large number of events and then each
event within the event has an arbitrary number of record like structures so
we'll have like a muon record that has values of you know various attributes PT
a to Phi and then we'll have several of them and this is not reducible to a
single flat table without some sort of padding or loss so if we tried to turn
this into a single table then we would have to say oh mu 1 PT mu 2 PT and if
there isn't a Mu 2 well then you've filled with Na and if there is a Mu 3
well you've lost it so so this so we don't want to do that lossy
transformation until it may be the very last stage of some particular analysis
and then the programming landscape is I would say rather conservative it wasn't
until the late 90s early 2000s that we switched from Fortran to C++ and only
right now that we are starting to use Python seriously in fact you can see
this transition happening right now I did some github statistics I selected
users who are definitely physicists because they they forked a particular
repository and then I looked at what are the languages of choice for their other
repositories the ones they didn't fork and since physicists started using
github in 2013 and to a large degree it's slowly been transitioning from
primarily C and C++ projects to now more Python and Jupiter and this year is the
turnover point so in our field the conventional analysis means you write
loops on objects allocated on the heap in the C++ line so this is what the
finding the two pions to find what the Kaon looks like it's just C++ loops we fill
std::vector and such a direct translation to Python would be a performance
disaster and in many cases this is exactly what people do and it is a
performance disaster because you know Python objects are doing a lot of string
comparisons and such so for high-performance computing in in Python
usually the first thing you think of as well we should you use numpy numpy if
you're not familiar is a library it's very it's almost it's an almost standard
library almost everything is uses it if you've ever used or heard of pandas
pandas uses it heavily and the objects in the Python world are really
just pointers to big buffers and the Python objects interpret those those
buffers as arrays numpy comes with a suite of
operations that apply to whole arrays so if you want to do something like compute
pz is pt * sinh of eta just that's something that you might want to compute
you can put individual numbers in there you can put one pt and one eta you know
one number in each and it'll compute that and give you one number back or you
can give a whole array of pts and a whole array of etas it'll compute that
and give you a whole array of PZ back and that's where you get the the speed
in Python is that you're not doing very many Python statements and you're not
using very many Python objects most of the stuff is some compiled and probably
vectorized loop in C another new thing about Python is sorry another neat thing
about numpy is that the interpret the the object that interprets the array can
change leaving the array in place without without changing the array at
all so just changing this shape and Strides parameter you can do all sorts
of crazy slices of these rectangular arrays without actually touching the
data the big data at least but numpy doesn't have anything for unequal length
lists and this is fundamental to our field you know it goes all the way back
to the 60s so we can't get away with it we need something we don't have anything
in numpy that could take a whole lot of events each with different sized
collections and do something like you know find all the all the pairs there
are libraries that represent arrays of unequal lengths this is becoming more
and more popular I've known notice that this used to be just one library and
then you know they're more for being able to deal with this and so this is a
good sign and if you're thinking about this and if you're thinking in terms of
traditional data structures where you have like a C++ class or
our struct and then std::vectors of them
it can be surprising to find that you can take those objects and actually
slice them the other way instead of making a bunch of objects this way just
somehow sliced them this way if that makes any sense so we look at the
diagram here we can slice them in such a way that we can make an array that
consists of just one field just all the values for that one field all the way
across and same thing for the next field and same thing for the next field and
this has lost all of the structure which of those values belongs to which
particle in which event and so then you can have another array represent that
structure so for instance you could have mean the simplest thing to do would be
to just have the account and this count says oh the first three of those values
belong in the first event and the next one value belongs in the next event and
the next one value belongs in the next event and then two values in the last
event that's nice except it's not random access so you can just cumulative sum it
to to get offsets and then you can jump to any one that you want furthermore
that offsets is sort of folded together starting positions and stopping
positions and sometimes it's useful to separate those out and have two separate
arrays for that and sometimes it's useful to have instead of pointers down
from the parents to the child pointers up from the child to the parent and that
if if you're in the SQL world is normalized table it's like a normalized
foreign key it's useful for some operations and others of these are
useful for other operations so a lot of what these libraries do and a lot of
what I just showed you actually it's just representing the jagged data in a
columnar way but now let's talk about manipulations so we'll just start with a
simple one just to to illustrate what kind of power that we that we have now
that we didn't have with non columnar data so let's say the thing
that you want to do is you want to remove the first muon from each event
remove the first particle from each of that you just want to drop it and keep
the rest well if you have traditional data structures that means you have to
rewrite those lists but if you have a columnar data structure particularly if
it's like starts and stops you see that you didn't have to touch any of the
content you can just change the number of where they begin for this particular
operation and then other operations do other tricks but yeah so we've got a
whole bunch of new tricks available if we have the data represented this way
and a really nice thing about this is what if the contents what if all of the
attributes corresponding to my particle let's say I have a hundred of them and a
given physics analysis this particular analyzer is going to touch at most ten
of them and then that analyzer is going to touch a different ten well you can if
you additionally make those content arrays lazy loading then you can do all
sorts of manipulations on your particles without even reading them from disk so
properties of columnar data I didn't prove it but these they are
fully composable so you could for instance say a jagged array's content is
not just flat numbers it's another jagged array and you can keep doing that
all the way down and you can do that in an arbitrary complex ways in addition
you can do more than just jaggedness non-driven on trivial data types like
heterogeneous you know so you might be thinking like a tagged union, nullable
types, pointers the one thing it has to be is it has to be statically type of
all it can't be completely dynamic data you have to know ahead of time what is
the type and most operations beyond just the slicing example I showed can modify
the structure the structure a so the counts of the offsets of the start stops
of the parents that the things I showed you in red without touching the
content so awkward array is a library that does this it provides operations on
columnar data with a numpy like interface so you could take for instance
that monstrosity so that that data looks looks terrible it's got all sorts of
nuns in it it's one of these the 4.4 and the racket 5.5 it's it's breaking
pattern it has some some tables in it that I'm representing with Jason like
dictionaries and you can ingest it into a columnar array so following some
recursive rules turned into columns and then you can do numpy like things on it
you can like if you're familiar with numpy syntax the way that you would
slice the second but not first a dimension of an array is you would put a
colon for the first dimension and then like - 2 colon to say the last two of
the the next dimension and it'll just do that verify that just speaking the last
two and you can apply array at a time operations like you could take the Cinch
of everything or you could add a hundred to everything or any of the the numpy
you funks just pass down through these nested structures and apply at the
lowest level so I'm here I'm showing the original data structure with everything
plus 100 the layout of this how what is this actually in memory it's a bunch of
Python classes that contain numpy arrays so so the depth of this structure you
can see jagged array index max array for the masking Union array for the
heterogeneous lists all these different kinds of structures but at the leaves at
the bottom of all this it's just numpy arrays and that's in that sense it's
colander and data types we want to be able to describe data types so the way
that we're actually describing those data types as we say well you know array
indexing is like a function you know you have square brackets a set
of round brackets but you can put something in there and if that's a
number then you get an an element if that's a string then you can get a
column from it and so then since and then you can talk about the the datatype
like a functional data type like this is intentionally made to look like Haskell
function data type of this arrow that arrow that arrow that to show what
happens if you put in a number here what kind of structured yet if you put a
number here what kind of structure you get you get a tree in general and here's
a point that is really an aside but I want to take advantage of my time at
strange loop with people who know these things we've found that you know arrays
are functions in general and array indexing is function composition and in
fact it's associative just like function composition and that's been really
useful in our work if anyone can tell me more afterward about it like if there's
any literature on this I really like to know this is just a sort of the thing
that we discovered by by tinkering with this and I bet there's like a
mathematical theory of it and I want to learn anyway that's an aside it's way
too much detail so let's get back to the main point what kinds of things can you
use this for so let's look at some real data and some not particle physics data
first change completely far afield somewhere somewhere you know way way out
there and let's consider a space data so look at the exoplanets data set and I
like this data set because it's naturally jagged a star can have more
than one planet so we ingest that from a JSON file because adjacent file is a
rather natural way to represent these structures and let's look at just one of
them or two of them I guess and you can see that among the the the fields in
this data one of the fields is planets which is a list of more objects so this
is the this is what the data means but the data is in a columnar representation
this data type is easier to read than the the monstrosity I showed earlier if
you put in a number less than 2900 something you will get a star and a star
has all of those fields and if you put in the field name for right ascension or
distance or mass you'll get the Stars right ascension or distance or mass and
those are numbers or if you put in the string planets you get a list of objects
that have their own properties so in well-behaved cases not ridiculous
cases this is kind of readable some of the numpy ash things you might want to
do you might want to just select all the star masses because you're going to
compute something you might want to select all the planet masses now look at
this that's that's a jagged or it has a nested structure some of these like the
the second-to-last has three items in it because there are three planets in the
second-last in this data set and we have sort of projected through the planets
structure to pick out just the mass so we look at the the data types for this
the star type is rather simple you put an integer you get a nullable
float some of them are none you know some of them missing you put in stars of
planets of mass and you get this jagged array of numbers you put in a number
less than this out of them you put in a number who knows because it's jagged and
you get a null will float so this projection slices through rows instead
of slit sorry slices through columns instead of slicing through rows which is
a very natural thing if it's rectangular tabular data but you can also do it in a
jacket case and another nice mathematical property of this is that
this means that putting strings and square brackets and putting numbers and
square brackets commute you can switch the order of the strings and the numbers
you can't switch the order of a number in a number or string in a string so
any property it's it's really natural if you had a if you had a rectangular table
you can specify the column before the row or the row before the column and you
expect to get the same thing in each case it generalizes to jacket cases II
you can represent this data with pandas so maybe maybe somebody who's really
into data science and has been using pandas for everything would say well you
know I could probably represent this this exoplanets dataset and pandas and
yes that's true you could take the you could take the jaggedness
that I had in you know was representing by the square brackets and put that into
the index so you see in this that we just turned into a panda's data frame
you see that the the the second to last has three planets because there's
there's a entry in a sub entry column at the left in fact it has some some nice
properties jaggedness in the awkward arrays goes to multi index rose and
nested tables in the jagged array goes to multi index columns that's kind of
neat why don't we just use P and s well we often have to deal with differently
jagged collections you know we'll have two electrons and one muon or you know
fifteen jets or you know these different types of particles in the same event and
so if we try to have my mouse here differently jagged things we can project
out and get a table for each one either one of them but we can't get a table for
both of them because a data frame can have only one index you can have and if
so if we were trying to live in that world we'd find that we would be merging
all the time we'll be doing nothing but merging there are other packages as I
said earlier that represent can represent jagged arrays and in some
cases like Apache arrow and X and D they have very complete type systems X and
even has pointers which is nice in fact our own software this root package goes
back to the late 90s when we were making this C++ transition and it represents
all of the data as jagged columns on disk so it was like we were almost you
know we're almost there but when it loads it into memory it
turns them into C++ objects so here we're just going one step further and
leaving them in that jagged form when they're in memory so what I'm trying to
do here is I'm not with the awkward array package we're not trying to
replace those other packages in fact we want to be able to interact with them
very well we want to transform back and forth to all these different formats so
our focus is not input/output or even numerical math as it is in the case of X
and D it's the structure manipulation that we have to deal with all the time
we couldn't in fact use any of those libraries as is so some examples of this
there's no chip time ok this was the example I started with say you want to
compute masses of particles like a goes XY or B goes to xx traditionally you
have to write loops and then the structure of those loops depends on
whether you're picking from the same collection you don't want duplicates or
not so as a array manipulation now we have primitives for this you know
because we're doing it all the time we want to be able to say data set X cross
data set Y to get the cross join per event or a data set X choose two to pick
two without replacement and so you know I just replaced them with function calls
those function calls are themselves implemented with a bunch of numpy tricks
so not not to study this deeply this is no point in here do we have like a for
loop it's all just numb pyon pyon pyon pyon
umpire and even the the choose thing I'm just flabbergasted that there was a that
there's a vectorized you know pure numpy solution to it and in fact it's limited
by the fact that algebraically you can't solve the quintic which I think it's
hilarious anyway it's my favorite software limitation if we wanted to do a
size bigger than five we have to write loops other examples like filtering I
said that was the next step you want to do let's say we want to filter these
particles well now you have a choice do you want to remove particles from events
what do you want to remove events so there have to be different ways of
expressing that and with loops there are different ways that you write that out
long handed loops and and just as a byproduct of generalizing them PI's
rules we get these these cute little expressions like like this where you put
a mask in square brackets and that mask is made by you know greater than cut or
max greater than cut and those and the first one is cutting particles the
second cutting events for this reason if you're familiar with numpy you've
probably made a raise of boolean's and passed them inside the square brackets
to select elements of the array we can do that here even if the elements they
are variable length things and we can also do that if we have a jagged mask
and it has the natural extension and in this case here where we said events a
max greater than cut that is a reduction the maximum is reducing the jagged rate
to a flat array and then it's a boolean there it comes out in a nice way
similarly I think I'm going to hurry along now the the part of the problem of
reduction is similarly made similarly simplified by
the fact that we're just generalizing what numpy already does to the jagged
case physicists are come up with much harder problems than these you know
these examples are deliberately simplified but this is an example
problem statement and you know looking at a statement like this I know you're
going to read it makes us think that maybe we want domain-specific language
and so here's a an attempt at a sql-like domain-specific language for trying to
express some of these really complicated selections but on the other hand when
you do that in loops like the way people are doing it now it's pages and that's
you know what currently makes research art this is not a hypothetical package
or used by only a couple of people we've been it's been sort of in the wild for a
year now these are our tip download statistics so we're tracking how people
using it with the selection that they're on scientific list Linux systems which
means that they are physicists so just picking the physicists just picking the
physicists awkward is this purple line when it was made a dependency of this
red line and so people are bringing in the the package to just read their files
and they're getting the the awkward arrays along for the ride but within
this world within this small world there's as many downloads of this as
matplotlib or panas we now have a lot of these composable things we just compose
them to get all sorts of different features and what do we want to do now
now that we you know successful now we want to rewrite it all now we've got a
lot of feedback from the users we there's some design changes we want we
also want to we want a C++ version of this so they can interface with existing
code a lot of which is written in C++ we want to be able to use number
because that would be fantastic if people could write Python expressions
and just should compile them and and do in those expert in those functions were
aware of this stuff but you know this if we're imagining writing three versions
of this one numpy one c++ one number it's beyond our abilities so and we also
want to fix some design mistakes that we're learning from people using and the
feedback and so we have an idea of building it in layers with a substrate
written and see where the actual algorithms go and then c++ or a number
where you have memory management and then that's PI bind 11 into Python and
so this is these are our plans this is what we're working on now finally like
is this only for particle physics I would say no otherwise I wouldn't even
come here and tell you about it the existence of arrow and xnd and the
momentum behind these projects suggests that it's not so that's the data
structures well how could that be domain-specific we have found another
example at least in the sciences where this join map filter reduce workflow is
highly relevant in genetics and we found this out because we found out that ZAR
are however you pronounce that czar the tsar library has ragged arrays and the
reason they need them is for genetics and then finally you know I've been
looking at this sim D Jason package which is awesome it parses Jason super
fast if we could couple this with the from itter and turn the Jason into
columnar data how many Jason datasets are there out
there you know maybe a log data or something like that semi-regularly that
you want to just ingest into a much smaller form and perform much faster
operations on so leave that out to you too there any ideas and finally there
are a lot of people are contributing to this we're getting
functions written by users and and I've had a lot of great conversations with
the arrow and the xnd team and the SAR team so I want to acknowledge them so
that's it thanks