00:00:00.060
good evening welcome to my talk about tree parsing my sky streams before I get
00:00:08.280
into that I want to to give you some context about why this work was done
00:00:14.599
this is nh3 which trees a website for Swiss music Swiss musicians it's the
00:00:21.210
meeting point between artists radios fans record labels and so on and one one
00:00:31.619
reason radio stations use it is that they counted apart for to you to
00:00:37.050
discover new talent local talent national talent in fact all the third
00:00:43.469
radio so good at wah and so on there are co-founders of the platform back in 2006
00:00:50.399
and they're still partners you can see here that this specific track is has
00:01:03.170
employees of the radio stations can does it have an account can can come to the platform and manually time something has
00:01:09.979
been played by them but of course this is a tedious process so most of it is
00:01:15.210
automated and this automation was done
00:01:20.490
for a very long time by a third party partner they ran a system
00:01:30.240
closer to the broadcasting software so as far as I know something involving a
00:01:36.290
Windows File share where the broadcasting system wrote files text
00:01:42.090
files with the current meta data then cron jobs running every second firing
00:01:48.360
Perl scripts that are staff writing to the database then another per script
00:01:53.549
like kind of matching the data like a whole whole thing needless to say did they at some point it did not want to
00:02:00.479
maintain it anymore and wanted to get rid of it at the end of last year which kind of
00:02:06.960
triggered my research
00:02:12.290
and it was also limiting before because
00:02:18.540
it was kind of out of our control this system would get the artists and
00:02:23.940
track data from Emmys free from the API then would do the matching there on its
00:02:30.720
system then push back the on-air events to today an extra a P I
00:02:35.750
but yeah this can be simplified so I I try to look for replacements I think
00:02:42.989
there's a commercial craftier API that provides real time data for these kinds
00:02:48.269
of things it was not not viable for our for a use case and don't be thinking of
00:02:56.130
course that the first you know that the first approach was oh yeah why don't we
00:03:03.840
put it also on the servers do something like the system that was there before yeah it's not possible it's too hard to
00:03:10.980
provision to deploy to do anything then I realized every every radio also has an
00:03:17.280
internet stream rights you can go to a website click the button you hear and continue and they're using mostly one of
00:03:25.859
those solutions so shop cast is the commercial solution that was launched
00:03:30.870
sometime in 98 99 something like this and eyes cast is an open-source for
00:03:37.290
implementation and most radio stations are using ice cast especially you might
00:03:43.109
imagine that the smaller ones because they don't have to pay for commercial licensing and so on and yeah ice cast is
00:03:54.810
kind of like the hidden API to to getting metadata about currently playing stuff on the radio I looked at libraries
00:04:03.319
there's one for Ruby got chart out it
00:04:09.090
was well for me it's not it's not production ready there's some problems with the encoding of the metadata it
00:04:17.010
uses threads in a weird way overall it seemed like it was written with by someone who's not
00:04:23.440
familiar with Ruby and if I have to fix it myself anyway I thought yeah might as well
00:04:31.290
research other solutions or it was invited myself or dated over I looked at
00:04:38.290
JavaScript libraries there's a few that helped me a bit also to understand the protocol but in the end reusing the
00:04:47.530
JavaScript library was also no-go because of the deployment things so image three is a Ruby on Rails application so everything was set up in
00:04:54.940
a in a different way and introducing this complexity was warranted so I did
00:05:01.450
the next best thing which is rewrite it from scratch this is the git history of
00:05:06.760
my prototype and this prototype went into production as all prototypes do
00:05:13.900
eventually even though they should but you can see that at some point I decided
00:05:20.110
oh yeah this is good enough maybe we should have tests and handle edge cases and so on and then yeah after after you
00:05:27.760
know trying to commit to something yeah I just so this was a separate repository right then I kind of ended this into
00:05:34.630
rails application and it's running the production mostly in modified for about
00:05:39.640
six to eight months so it was a pretty good prototype I'll show you the code some of it is you know not point
00:05:45.910
production ready but proof-of-concept ready and yeah ship early and look
00:05:52.900
forward look for errors right so one aspect before we get into it one aspect
00:05:59.650
was how to get test data of course you can run the thing against live radio
00:06:05.500
streams but if you have automated tests as this library has you want to get
00:06:12.700
something stable or something that doesn't depend on internet train you can put it on CI and so on so what does this
00:06:21.370
do just sends a an HTTP request to to a radio stream and then we get the first
00:06:28.270
however many kilobytes we want put this in a file
00:06:33.650
you can even add it to your repository you know 20 K is not that big of a file size right and this allows you to get
00:06:40.340
feedback quickly you know they're sweet yes they're not that many tests but
00:06:45.560
there are higher higher higher level tests and they gave me enough confidence to to refactor and and yeah handle edge
00:06:53.449
cases and so on and the good thing with writing this as a separate library
00:07:00.530
separate repository outside from the application was that it was completely decoupled from rails so once I put back
00:07:09.169
into rails well for one the test suite continued to be really fast but but also
00:07:15.020
it could be something I have considered open sourcing this and I could but I
00:07:21.440
don't want to really maintain it I'm using it for my own purpose if someone finds it valuable why not but the point
00:07:30.770
is to have a solid solid system for my use case yeah it's like that because
00:07:39.229
people I don't see this in many code bases just wanted to give an example of
00:07:44.930
how you know to not load rails and tests so you could have usually people have
00:07:50.360
something like a test rails helper but you can have a test helper that it's beautiful for unit tests so you load
00:07:57.080
your test test framework whichever it is and maybe have some way of requiring
00:08:04.780
files relative to the project route in an easy way and then you know you
00:08:10.789
require your test helper the library to the file that you're on the test and
00:08:17.659
this is it and it's really fast and for example I have this hooked up to or
00:08:23.630
running the current test I haven't hooked up to a key in my editor and I get you know some hundred
00:08:31.610
millisecond response which is cool right that's let's get into the protocol
00:08:38.219
itself so ice cast is a client-server
00:08:45.589
kind of protocol you make one request to the server and by default you get all
00:08:51.839
your data but you can send it an HTTP header this this magical icy metadata
00:08:57.720
header set to one it's called the icy protocol I think for ice cast or
00:09:04.139
something and then in response you get some some headers as well so for example the bitrate bitrate some other metadata
00:09:13.829
that's actually I'm not using and then this is the important one it's the the
00:09:20.279
metadata integral I'll get to it shortly so this is these are the headers and
00:09:26.550
then in the response body you get binary data the binary later you get is a block
00:09:34.439
of all your data then a byte which says how much metadata is about to follow
00:09:42.019
then you get the metadata which looks like something like this so you always
00:09:47.129
have a stream title equals quotes and then the name of whatever is passing
00:09:53.430
through but then afterwards you get new
00:09:59.000
new audio data and so on and so forth one thing to note is that the metadata
00:10:05.639
is not necessarily there so this length bike could just contain the value 0
00:10:11.519
meaning that you know it's not every frame or it's not every interval that you get metadata usually you get some on
00:10:19.439
the change of a song so every three minutes or so right so I looked at how
00:10:26.759
to get there stomach Ruby I are not one
00:10:31.920
for including you know extra gems library stuff like that I don't like
00:10:37.680
projects where you have six different networking libraries two that have slightly improve interfaces on top of
00:10:45.269
that but they used to do the same thing so I wanted to keep it looked at what was it Ruby standard
00:10:50.320
library and the API is not great but it works in this example though this is
00:10:57.130
seems you are doing a request or you get
00:11:02.560
a finite response and we don't know this dream with an internet radio stream you get a you know always running requests
00:11:10.209
on sorry response unless the network is down or something so you can do a
00:11:19.029
streaming you can handle streaming responses with this read body method and
00:11:29.470
you get chunks what I did not like about this is that you don't really have
00:11:35.250
access over how much directly because
00:11:52.290
the sockets TCP socket interface in Ruby provides the i/o API which allows you to
00:12:00.250
read and write exact and yes there's a bit more about keeping going on for some
00:12:07.870
things but for these precise things in the protocol where you want to read this number of bytes then read this byte and
00:12:14.080
so on it made things a bit easier and again as I said I use this in production
00:12:19.270
for four or eight months so one thing to do first you send some some headers so
00:12:26.230
this is of course this connection this connection class is just a small
00:12:34.209
abstraction over the socket to put new lines
00:12:41.730
so I'm sure you're familiar with with HTTP you request a certain path which
00:12:48.459
reroute the host then you send this header the OEC metadata headers one I
00:12:57.129
guess to empty lines - well this is one empty line and then because the right
00:13:03.069
line and sweeter you get you you and your request kind of yeah maybe I'm
00:13:09.370
getting this wrong anyhow then you wait for a response you get the HTTP status
00:13:18.550
in response this is the line which says oh you know HTTP version whatever 200
00:13:24.850
okay so this is the this is from the very first commit so here I did not even
00:13:30.399
bother looking for am I getting bad HTTP this kind of stuff which version I just assumed you know if
00:13:37.319
this public radio has an internet stream they probably doing it right so I wanted
00:13:44.319
to get stirred the whole the whole program done right so then you read some
00:13:49.839
headers displayed by semicolons so you can put them in a hash and read the one
00:13:57.069
you're interested in which is the hi-c meta integer and then you're ready to
00:14:05.110
raise himself and again the first the first trial was really hacky just a
00:14:15.399
regular expression against whatever you getting to fetch the stream title and
00:14:20.579
see if you recognize some strings in the render output data this thing with the
00:14:27.250
unpack one I found it a bit interesting so there's a there's a way to unpack
00:14:34.600
binary strings in Ruby there's two methods one is unpack and unpack one so
00:14:39.850
unpack returns on the Ray and unpack one returns just the first element of the array so you give as a parameter you
00:14:49.509
give a directive for the four and the format can be a number of things
00:14:56.529
flows and integers and so on for this protocol we only need to the 8-bit
00:15:02.620
signing integer and a string the directive can be followed by by accounts
00:15:09.100
and if it if you put a star it just you know repeats until the end of the input
00:15:14.399
so for example here I decode one character from this string and here I
00:15:20.290
killed everything it removes trailing no
00:15:27.000
terminators and space instantly so yeah I gave it a try and this works this is
00:15:36.459
not very interesting data but it works and here oops we have some trading garbage and this is because we read too
00:15:43.480
much data so how to read just enough data we can actually use the protocol as
00:15:52.089
intended which is to read this length
00:15:57.540
bit for the metadata it's just sorry what are the length byte
00:16:04.180
for the metadata it's a number there's a fixed block size
00:16:10.060
in the protocol and I think it's 16 and so you know that whatever length you
00:16:15.250
read you multiply by 16 and you get the size of your metadata as I said you
00:16:20.560
don't get metadata at each iteration so sometimes you have to just continue
00:16:25.660
eating all your data by the way you see here we've completely discarded all your data right so same thing is before you
00:16:35.290
read your metadata and this is one of the fixes I brought well one of the things I had
00:16:41.889
to fix compared to the original library which was encoding most streams I
00:16:48.309
encountered are actually laughing one include you know ISO a date 9 5-1 some
00:16:57.009
of them are a utf-8 encoded and there's no
00:17:02.970
you could try to guess the encoding but it doesn't really I have a I have a
00:17:09.810
fixed amount of radio streams and I can just parse each one for a bit and try
00:17:18.030
different encodings and then put it in a in a configuration somewhere saying okay this radio stream has you defending
00:17:24.720
encoding in this one there is Latin one so here we can transcode as we wish then
00:17:35.820
I move some of the metadata passing to a different thing and then because we
00:17:41.190
actually want to continue reading data and just yield the title we got to
00:17:47.010
whatever is it's calling these invoking this this method and we continue on and
00:17:54.140
the point is that on the other side you should also try to be as quick as
00:18:00.120
possible to write this somewhere else and and and not block the main process
00:18:08.840
yeah so that that that well was a bit better and so this is what actually I
00:18:14.100
should in production and I thought about
00:18:20.250
some of the choices I've done so one was it didn't really support HTTP reading
00:18:26.790
stuff you could you could in call
00:18:33.150
openness so in there but then in the end you you're basically implementing net HTTP it's not worth it and also there
00:18:41.100
was just I Indiana was not comfortable with doing so much HTTP in this nice
00:18:49.200
library so I went back to to actually reuse net HDD and so I had to find a way
00:18:54.540
to deal with the the reading of bytes in
00:19:02.100
a controlled manner I also get some
00:19:07.170
other things for free option in a bit right so the beginning is is the same we
00:19:13.920
get the headed back the response header and then
00:19:19.660
I introduced this abstraction the chunk ioi it's a it's a modified version of
00:19:24.880
something I saw from a library called down it's a ruby gem and I massively
00:19:31.890
simplified it because I didn't want the whole power of having this i/o interface
00:19:38.650
so I just wanted one wanted to Metis basically this read method you can see
00:19:45.610
here I am as an intermediate I'm transforming the stream body to an
00:19:55.180
enumerator and then I can get chunks on demand so diving a bit deeper into the
00:20:04.870
code side what Reed does so this is the length we want to read so we start with
00:20:13.300
an empty buffer and this is the the thing that can get us the chunks we read
00:20:21.520
from the source as many times as necessary until we get the right length
00:20:29.250
and you have to to realize that this read our show reads at most this number
00:20:37.060
of bytes it could read letter fewer and I'll show you the implementation right away so if the buffer is now we get a
00:20:48.040
chunk from the HTTP response we get the
00:21:01.090
if we did read more than then limits number of bytes we store the rest in the
00:21:08.140
buffer and is instance so this is actually a stateful thing and whenever read partial gets
00:21:15.910
cold again buffer won't be empty so we don't retrieve a chunk yet we try to
00:21:22.680
push through the rest of what was in the buffer and
00:21:28.770
so the the original library I took it from did a bit more to I guess make it
00:21:38.440
more efficient I don't care if I have an extra boo boo - so this was enough and
00:21:44.290
an interesting thing I discovered was this exception stop iteration which is something that gets called when you want
00:21:51.250
to stop an iteration from the Ruby numerator and stuff and of course I want
00:21:56.980
to translate it in something into something more tangible for the library since we are getting network data and
00:22:02.770
the file error I mean I know and the files but Ruby and UNIX legacy and all
00:22:08.740
of those things right okay so so this is
00:22:14.530
the happy path what about other stuff there's redirects I think I wanted
00:22:19.660
redirects I've encountered or server errors so maybe the the Broadcasting
00:22:27.310
System is down or something I mean sometimes you get a 500 and maybe also you made a mistake yourself so what you
00:22:36.430
want what I want to do is handle those cases and some other things like timeouts and you know interruptions
00:22:44.260
socket errors whatever in an integrator way and the idea is that some things
00:22:53.500
should be should should crash the process for example if I have a client
00:22:59.050
there so if I made a mistake in the code I want to see this in my bug tracker but
00:23:05.080
for the rest of the errors I just want to retry I can sure I can lock the errors and so on but I want it to
00:23:10.690
recover I introduced an abstraction called attempt and this code is a bit
00:23:21.670
awkward because I had to make it fit into a slide but this retries method
00:23:29.580
defines which exceptions I want to rescue how many retires I I
00:23:36.100
want to allow what delay I have between retries which
00:23:41.700
is which is a lamb which takes a lambda because I want to implement usually exponential back-off where he tries and
00:23:48.840
then extra things you want to do after each retry extra blogging or then so
00:24:02.519
this code is within the connect method of the of this ice cast parsing client
00:24:09.630
and it takes care of the retry logic and
00:24:16.139
then there's one which is completely oblivious to what what's happening
00:24:22.860
outside of it which is the connect without a choice and this thing here the
00:24:28.769
validation it may be over-engineering but what i wanted is if if the block
00:24:38.730
crashes for nine times and then it works and then it works for another three hours if it's practice again I did not
00:24:45.690
want to crash the surrounding process so I said okay you know what once it succeeds once it's good again
00:24:52.710
and we reset the counter of the bridge rise again and here's the implementation
00:24:59.940
so we increment the number of attempts we run whatever code we want to run and
00:25:07.580
I've removed some things but we write an error which hopefully we did a nice
00:25:13.409
message about the stack trace and yeah stuff like that I also made it here a
00:25:21.169
longer message about okay this is retry number six we're trying in 10 minutes
00:25:27.860
and so on and finally we retry so yeah
00:25:33.960
we'll try this book right so this has a
00:25:39.299
bit of stability to the to the thing we don't have only one string
00:25:46.070
just like 510 and at this point 15 15 radios probably and you want to and this
00:25:55.820
is in the context of a rails application so even though the process doing this
00:26:03.800
reading is a continuous background process it's supervised and everything
00:26:09.650
it still has to write a database and if
00:26:15.650
you want to write stuff or 15 different radio streams which could write its
00:26:22.720
random times more or less you have to be careful with
00:26:27.740
how you connect it to database because you could exhaust your connection pool
00:26:33.440
and so I adopted the classical you know concurrence thing of multiple producers
00:26:40.400
one consumer where the they communicate with the queue this is from the threads
00:26:48.410
library in Ruby then we have a consumer so it waits for something to to get into
00:26:58.040
the queue so it works here once we get in an element we store the information
00:27:06.890
into the database and then we fire a background process to to handle this accordingly on the producer side we
00:27:19.100
instantiate our client we do the parsing of the metadata and this block is called
00:27:27.560
whenever there is a new title so whenever there is a new title we put this into the queue
00:27:36.820
there's a few subtleties one is so amazing sidekick for this sidekick
00:27:45.160
guarantees that the job runs once but it
00:27:50.990
could run more than once so you have to do not Prost things twice
00:27:57.030
sometimes you get just a bad metadata from some radios I don't know for one
00:28:02.230
day maybe the sister-like currently playing system is down so they
00:28:08.140
send your audio but not the description of your do you get jingles as well so we
00:28:13.210
what echo jingles is the telephone number of Facebook page of the radio
00:28:19.350
stuff like that advertising basically and you also get partial matches of course I mean partial
00:28:26.230
matter what I mean is partial matches is that we have tracks in the mx3 database right so even even if an artist is in
00:28:35.980
the database they have not uploaded all the tracks and we want to notify this artist if the ones that did upload are
00:28:43.290
being played on the radio and so yeah
00:28:50.110
how do you how do you do that I approached it with Postgres though spirit has a module called for the
00:28:56.350
string match and this one has a method for the or a function calculating 11
00:29:06.190
strong distance which is the distance between these strings the number of different characters so let's start with
00:29:15.480
with the band so let's assume that the track by the artist now Jefferson called
00:29:23.020
Fame is being played so this is what we received with powers the meditator would
00:29:28.900
normalize the whole thing making it as easy as possible to match against what we have in the database what I do is I
00:29:37.930
want to get the closest match so I limit to one result we order by distance with
00:29:44.830
the closest distance first and so hopefully we get the a band ID along
00:29:50.320
with the distance from the thing that we want to match then we do something
00:29:59.020
similar with the tracks of the same band and by the way this Levenstein function
00:30:06.610
it has a faster alternative I think all understand less than which is made for
00:30:11.799
small small distances what it does is it
00:30:17.710
has it has threshold so if you say for example the threshold is 4 if the
00:30:22.870
distance is lower than 4 it will compute the exact distance otherwise it compared
00:30:29.620
gives you 4 I did not use it because if the distance is greater than 3 in this
00:30:38.379
case I actually don't want this to be recorded as a partial match so for me it's too too far away I mean this is
00:30:45.610
more I have observed this in this thing
00:30:50.769
the exact number cannot find Union but there's a threshold beyond which you just get in you know if two artists have
00:30:57.309
the same number of characters in the name of course it will be in like some kind of distance it makes no sense to
00:31:02.889
try too much so Indian what I've caused
00:31:08.559
translated this into active record stuff with sanitization and and so on
00:31:17.799
I get the band and its track and if both
00:31:24.159
the band and the tracker found I returned the track along with the
00:31:30.730
combined score so the song of the two distances and then if this is zero the
00:31:37.570
distance is zero then it's an exact knowledge if it's not an exact match then you have a partial match and this
00:31:43.899
can be presented to an admin who can then accept or reject and yeah and this
00:31:52.450
can be fine-tuned but yeah and basically this is it thank you for listening