HomeVideos

И35: А.И. Коробейников | Компиляторы | LLVM | MLIR | Clang | Биоинформатика

Now Playing

И35: А.И. Коробейников | Компиляторы | LLVM | MLIR | Clang | Биоинформатика

Transcript

531 segments

0:00

- Anton, hello! Thank you for coming. - Hello.

0:05

- We know you as an expert in LLVM, first and foremost.

0:09

But I think we will not only talk about that today, but also about bioinformatics.

0:14

It's something I don't understand at all, but it's very interesting.

0:18

But first, a question about LLVM. Can you explain to our listeners who are not familiar with it, what it is?

0:23

- That's actually a very interesting question. About 5 or 10 years ago, we looked it up on Wikipedia.

0:34

It proudly stated that LLVM is a low-level virtual machine and everything else.

0:41

But as with all abbreviations, it is neither low-level nor a virtual machine.

0:49

Yes, that might have been some kind of initial idea, but to put it simply, it is a framework,

0:57

as they say in modern terms, for building compilers, for building tools,

1:04

for code analysis, code transformation. Over the last 25 years, it has grown quite a bit,

1:11

and it has covered, in principle, more or less all the tasks we might have around compilers,

1:19

around the relevant tools, code analysis technologies.

1:23

So it's a working tool, a workhorse, which, in principle, can be used,

1:29

and which is actually used by a lot of people, all over the world, in many products.

1:35

- I understand that it is primarily an intermediate language within the compiler?

1:40

- Well, that's part of LLVM, what is called LLVM IR, that is, the internal representation of LLVM,

1:47

but it is precisely that representation of code that we work with internally.

1:52

But you can't just work with the code as it is.

1:55

You need some tools for that, you need something that allows you to modify this code,

2:02

you need something that allows you to analyze this code, even in a basic way.

2:08

So we still won't work with the code in some natural representation.

2:14

We need to have the ability to have a representation in memory, we need to have a representation

2:20

in a serialized form to be able to do and save a file somewhere and so on.

2:26

So, in principle, that's part of it.

2:28

But, of course, on one hand, it is a fairly central part of all LLVM,

2:35

because it is this part that largely dictates the decisions that have been made,

2:43

the design that is built around all of this.

2:46

But it is not the only part of LLVM.

2:49

The innermost representation is quite, one could say, useless,

2:56

without the tools that can be used for its analysis and transformation.

3:01

So if I write a compiler for my new language, it is beneficial for me to use LLVM

3:07

as a framework that I rely on in my compilation methods and further code optimization.

3:16

This is one approach.

3:18

So it is clear that there is no silver bullet.

3:21

There are no universal solutions that work in all cases of life.

3:26

Plus, LLVM is quite Russian-level.

3:31

But if you need to quickly support a certain number of standard optimizations

3:40

or get working code generation right away for a bunch of platforms,

3:46

then LLVM is probably a pretty good default option,

3:51

where you can just take it, plug it in, and get something.

3:54

Again, there are things that LLVM does not handle very well.

3:59

This is, in principle, as always.

4:01

How did you get into this story?

4:03

How did you get lucky, or maybe, on the contrary, was it some kind of misfortune?

4:08

Why did you get into this development?

4:11

This is actually quite an interesting story, I guess.

4:15

So it was probably already 2025.

4:21

Well, I think it was 20 years ago. So it was around 2005 or 2006.

4:27

And it turned out that at that time there was a certain project,

4:30

for which code transformation needed to be done.

4:34

And for that, we actually needed some tool.

4:40

At the same time, various source-to-source transformation options,

4:44

that is, at the source code level, would definitely not work,

4:49

because the source language was CAC++.

4:52

Well, for that, it means that you actually need to raise a full-fledged CAC++ compiler.

4:57

So, naturally, that is not what we generally want for a semi-sequential project.

5:04

And then I started looking at what options we had.

5:07

What, where, how. We tried to make this structure at the BTC level, but it didn't work out.

5:14

There were some academic options, but again, there were problems with that too.

5:18

And then I stumbled upon LLVM, we looked at it, tried it, and it seemed like we were getting somewhere.

5:26

But there was an interesting problem.

5:28

The fact is that support for Windows was needed as a target platform at that time.

5:33

And LLVM's support for that was mediocre, to say the least.

5:40

So, primarily, my work around LLVM was about adding Windows support as a platform.

5:53

In fact, support for Windows as a host platform, where the compiler itself was running, was relatively okay.

6:00

In particular, the code was written quite platform-independent, all C++, and in that sense, there were no problems.

6:07

The problem was specifically that there was no support for Windows as a target platform.

6:12

That is, so that the generated code would work on Windows.

6:16

This includes calling conventions, all sorts of our favorite things around calling or importing functions from DLLs.

6:27

Well, in general, that's how it all started.

6:30

At that time, I even spent about five years, maybe a little less, making binary packages for Windows.

6:39

Well, then it went on with some things around Linux, LLVM, basically a bit of everything everywhere.

6:46

And at that time, was it a product of Apple?

6:50

No, that was before Apple, so Chris wasn't at Apple yet.

6:54

At that time, it was purely a research project at the University of Illinois in Champaign.

7:04

So it was a little-known open-source product at that time?

7:07

You could probably say that.

7:09

But given that it was made quite well, and I must give credit,

7:14

since then I have seen quite a lot of academic developments.

7:16

So compared to other academic developments, it was probably a head or two better in terms of points.

7:22

Design, usability.

7:24

Well, and still, despite the fact that it was an academic development,

7:27

there was already a group working on it led by Vikram Adve.

7:32

So Chris was leading that group.

7:37

But it was not a project of one person or one student or one graduate student.

7:42

There were, for example, about five or six people.

7:45

Everyone understands that Vikram had a whole grant for this.

7:49

So in that sense, there were some resources.

7:53

And not given.

7:55

I understand that they were trying to compete with GCC at that time?

8:00

Or was that not the main goal?

8:03

No. There was no competition goal.

8:05

Moreover, at that time, Chris came to the list of GCC developers.

8:10

He said, guys, let's be friends.

8:12

Here we have LLVM, let's try it as a representation.

8:17

Its builders were quite cool about it.

8:22

Partly due to licensing agreements, considerations.

8:26

Because at that time, GCC was not very well disposed towards any code

8:32

that was not under the GPL license in principle.

8:36

And plus, the architecture of GCC at that time did not really allow for any embedding.

8:43

On the other hand, it is impossible to build a framework for transformation analysis,

8:50

optimization of program code, and code generation.

8:53

Well, without getting a front-end.

8:55

The GCC framework is absolutely useless to us

8:58

if we cannot use it to compile any mainstream languages.

9:03

Then creating a front-end for some niche projects is great.

9:08

But still, in order to move further and take us to the next level,

9:14

we now need a front-end for some person

9:17

who depends on an industrial language.

9:20

And it is quite clear that developing a front-end even for C

9:29

is still a fairly non-trivial task.

9:32

The specifics of the language and those million little details that need to be taken into account.

9:41

And which are carefully specified or not specified in the standard.

9:44

Plus, it is clear that there is a large amount of everything spread around this with a thick layer.

9:51

Due to the various extensions that exist in different compilers and so on.

9:55

Therefore, the natural step is not to try to do everything from scratch.

10:02

To invent our own five-wheeled bicycles with square wheels.

10:07

But rather to simply reuse.

10:12

And thus, what we then called LIMGCC came into being.

10:17

That is, when we take the front-end, LIMGCC, and try to attach LIMG as a video

10:23

and back-end. And in a sense, a synergistic effect was achieved because, on one hand,

10:29

Ello Yan as a project received a full-fledged front-end with complete languages. Enthusiasts quickly emerged

10:35

who needed, for example, Ada as a front-end, and it seems that

10:42

according to rumors, it actually worked quite well. Some needed Fortran, and, in principle,

10:50

GFortran also worked somehow. I even fixed some things around that. Then it turned out that,

10:59

since ProJC already existed, it allowed, on one hand, for Ello Yan's community to get

11:09

something ready on which they could start building a ready-made melting. On the other hand, yes, the GC community received

11:20

a kind of obvious competitor that also spurred development in one way or another. And in the end, it won,

11:25

as I understand it. To put it mildly, it won. I would roughly compare it to a warm death.

11:34

If we talk about various proprietary things, then yes, I would probably say that Ello Yan uses

11:43

more often Perk and China, which is indeed a license. On the other hand, GC supports a much larger number of platforms,

11:49

including quite exotic ones. And they are there. Many things in GC are quite well

11:59

preserved and change very rarely. Therefore, you can create a back-end for some rare platform,

12:09

and it will live almost unchanged for 5-10 years within the codebase. Plus, historically, GC has had

12:18

front-ends for Fortran, front-ends for Ada, and some others. The code for which, in principle, was not available in Ello Yan,

12:27

but that was not the goal. That is, front-ends for Fortran are only being attempted with the third or fourth approach to the task.

12:36

So I would probably say that this is not a competition or something that will only win, but rather a coexistence.

12:43

Plus, again, many well-known open-source projects have actively used this, it’s probably not an expansion of CC,

12:52

but rather a specificity of work. Starting from HydroLinux and even including the standard library JellyPsy.

13:06

What was your main contribution after you helped with Windows?

13:14

Well, I mean, historically after that, I was quite involved in supporting the Elif platforms.

13:22

So, on one hand, it seemed like small things, but on the other hand, it couldn't have been done without them.

13:29

That is, all possible things we have were in the types of visibility, various weak symbols, binding details.

13:49

At some point, I was working on support for Dwarf, which is specifically about pitching the format for debugging and for Elif platforms.

14:02

We use Dwarf for exception support as well. So, support was needed from both the code generation side and the runtime side.

14:13

Probably the next thing was this. Along the way, there were a bunch of different small tasks.

14:21

At some point, I did something for fun, but also as a small documentation effort.

14:34

That is, as always, we have the problem of large open-source projects that are actively developing and moving forward, which naturally leads to gaps in documentation.

14:44

And if, for example, we need to write documentation for AR, because, in fact, it is a specification, and it cannot be done without it.

14:53

Well, plus AR as KTI changes very little.

14:56

So, on a separate part of the backend, having documentation that is up-to-date is quite painful and cumbersome.

15:09

If you don't have a dedicated technical writer who constantly tracks, updates, writes, and so on.

15:16

And it so happened that there was practically no documentation regarding the backend. Or it resembled that well-known meme about drawing an owl.

15:26

We start drawing an owl, like how it would be, and then, to show, "Hey guys, look, you have, I don't know, 250 thousand lines of code, this is x86 and the backend."

15:37

And there was nothing on the wall. I even had a couple of talks back then, in the series "How to build a backend in 24 hours."

15:48

As an example, I created a backend for MSP430, a humorous microcontroller that is still very much alive because it costs just a few cents and everything else.

16:03

That's why it is still being produced and sold, they are sold by weight. Plus, it has a very regular architecture, with 20-something instructions, which is basically nothing.

16:20

But at the same time, everything is there. There is memory, there is arithmetic, a thousand other things. So, I probably spent a few evenings on it. 24 hours, of course, was just for effect, but in principle, I managed to create a backend in a week of evenings, and it is still very much alive; I am still considered its maintainer, and I still support it, refine it, and review it for regressions.

16:47

And as far as I know, people are still using it, especially after the RAS community.

16:56

How do you assess the complexity of this project within LVM?

17:01

The complexity at the moment is actually quite small. I seriously doubt there is anyone who understands everything.

17:07

It's important to understand that LVM has evolved into an infrastructure over 25 years. Or, as we can say now, into an ecosystem.

17:18

In reality, we have completely different components, each of which does different things, because, in fact, Clang is part of LVM.

17:28

MLR is part of LVM, Clang is part of LVM, Debugger LDB is part of LVM, LD is part of LVM. So, all the components are, of course, an integral part if we are talking about building a full-fledged Intelluchain for something, but for someone to understand all the details and nuances—no.

17:53

On the other hand, in my opinion, it is quite important. Fortunately, the project has managed to maintain some transparency in its codebase. So, you can sit down and quickly figure out what is happening at some principle level.

18:09

The main thing is not to be afraid to dig wide rather than deep, probably. So, the project itself is quite large. Plus, it's clear that some time ago, development of the library began.

18:24

So, if we started, it would be impossible to imagine any compiler without the standard library for C++. But now, in addition, development of the standard library for C++ is also ongoing.

18:42

This, of course, adds code if we consider it in lines. But, on the other hand, what is important is that these are separate components, not just some mixed-up mess of packages.

18:52

Don't you think that after 25 years, any software product becomes morally outdated and it's time to consider rewriting everything from scratch? Is that not the case for you?

19:05

That's true. I won't even argue with that. But in a sense, MLR, that is, what we are observing, is a kind of radical approach to blow everything up and rewrite from scratch.

19:20

So, on one hand, it can be said that in LLVM, and I can talk quite a bit about this, there are some unfortunate design decisions that were good at one time but turned out to be quite unpleasant later.

19:39

And then, of course, as always, when we have something suboptimal at the foundation, we will have to live with that legacy because applying it endlessly is painful and not very clear what it's actually for.

19:59

We’ve lived with it for 20 years, endured it, and probably could endure it a bit longer. On the other hand, BMLR tried to address some of the design issues we have in LLVM, such as flexibility, extensibility, and some aspects related to the internal representation.

20:20

Plus, again, it should be understood that LLVM is not universal; LLVM is quite low-level, and as a consequence, in order to reflect all sorts of disk-level things that are specific to a particular language, it is quite inconvenient.

20:45

We are already used to doing and developing things on computers.

20:55

And if we were to be handed a rithometer now, it would probably also be painful and costly to deal with something like that.

21:00

On the other hand, it is clear that it is possible. History shows that people have dealt with all of this.

21:06

The question is about the ratio of effort to output.

21:09

Therefore, MLR, as an approach, is partly a way to solve the problems that have accumulated.

21:16

But what is important is that this is not an attempt to just throw everything away and rewrite everything from scratch.

21:24

We see that there are design problems that prevent us from doing certain things.

21:31

Let's take a look. We'll develop something that incorporates the good aspects we already have.

21:37

A classic example. There are quite a few various internal representations developed for compilers.

21:48

And somehow, in many cases, it turns out that, for example, this internal representation is not serializable.

21:57

That is, you can't just take the internal representation and export it to some file to read it manually and then load it back.

22:05

In my opinion, this is one of the main cornerstones: the internal state of the compiler should exist in three forms.

22:18

That is, in some serialized binary form, in the form of a file, in a textual representation, so that a mortal can read it and understand it in a reasonable time without needing complex serialization tools.

22:33

And of course, there is also the representation in memory, which allows them to export something from the program.

22:39

And it suddenly turns out that something that seems quite simple is, in many cases, an innovation for many representations.

22:52

Yes, there is a debug dump. No, we cannot read the dump back because we cannot set it up that way.

23:02

So in a sense, as I mentioned, Malaya was such an approach, and it continues to evolve as a project, as part of LLVM, as a signal that we have.

23:15

Well, on the other hand, as always, when we solve one problem, new ones arise. It seems that the law of conservation of energy has not been canceled.

23:26

I know that large companies often create their own versions of Clang or LLVM, including IAR, and branch off without returning to the main branch.

23:40

This happens often; I hear these stories. You have probably encountered such customization. You may have even participated in such projects.

23:50

Tell me what you think about this and whether it is good, and where it ultimately leads. Whether it is a problem or not, I probably can't say for sure.

23:59

But yes, such experience exists. And here, as always, since we are talking about industrial solutions, it is important to have the right balance between cost and what you get in return.

24:17

This is what is called cost-benefit. It is important to understand that any such forks, any downstream code we have, come with ongoing maintenance costs.

24:31

And the question is whether the company that created it is ready to provide that support.

24:39

Because there are completely different scenarios. You take something once and then live with it for the next 5, 10, 20 years, and then repeat the cycles.

24:51

For something, this might actually be a workable option. If we are talking about a target, such as a compiler for C++,

25:04

let’s say for some exotic platform, sooner or later users start asking where our features from C++38 or C++42 are.

25:21

Naturally, this requires synchronization back with the Mainline, at least in one direction.

25:29

Some companies have understood this quite well and have even learned to live and work with it.

25:36

The thing is that as soon as you give the code back to the Mainline, you relinquish its support. Someone else starts maintaining it for you.

25:44

Therefore, if it can be done this way, the company does it. There are precedents where companies have tried to create their own IR on top of LLVM IR for something.

25:59

So there are actually several examples. So far, all the examples I know of are related to compilers in one form or another for graphics coprocessors.

26:16

Because usually, when our target platform is a GPU, we do not generate code specifically for a particular piece of hardware.

26:25

Instead, our output is yet another internal representation, possibly even lower-level, which is then compiled by a specific graphics driver.

26:36

Depending on which specific graphics card a person has. We have already established some layer of abstraction from the platform, and the details are handled further.

26:49

There are precedents where someone is still living with a representation on top of LLVM 3.7. There are those who are living with a representation on top of LLVM 7, as far as I know.

27:03

They chose this path; it was their decision, and once they did that, they took the code, and it’s up to such a company to develop and maintain it, why not.

27:21

The license allows it, and moreover, it is in the spirit of open source.

27:28

Why are they doing this? Why do they take Clang, for example, and try to insert some additional features into it? What is lacking in the standard Clang?

27:38

And what is the business effect of expanding or improving the compiler? What could be wrong with the compiler?

27:47

Some of the issues may be related to the simple need for compatibility. Imagine that a company has had compilers for 30 years that were not based on Clang, not on GCC, but had their own custom-written compiler with some of their own wonderful extensions, say, for C or C++.

28:12

And then we need to, and naturally, such a company has its own codebase, and there is a user codebase that still needs to be compiled using these extensions.

28:26

So then we have the absolutely standard situation where we either tell the user, "No, guys, sorry, but starting from version N+1, your code will not compile, so you will have to rewrite it."

28:41

And the last time this code was touched was 10 years ago, and therefore the last person who knows how it works has long since retired.

28:51

Or we can add support for the compiler extensions. This is a classic example, that is, backward compatibility.

28:58

Somewhere, extensions are needed for the corresponding target platform. We are creating yet another, I don’t know, accelerator for something, for AI, for some kind of network traffic processing, or something else.

29:18

Yes, our goal is an incomplete C. For example, our goal is some subset of C. Well, that means we need to carefully trim away what we do not support.

29:28

Additionally, we need to enhance this by incorporating the features of our platform into the corresponding language.

29:43

Again, we get language extensions that are specific to a particular platform, a specific vendor, or something else.

29:53

Something like that. Plus, there are simply cases where we have what seems to be a fairly standard platform, but it is closed.

30:01

That is, you cannot just take it and start writing code for it. And don’t think that this is something exotic. All those gaming consoles that are out there, three and a half of them on the market,

30:15

they insert such a closed platform. Well, and we need to have, for example, basic support for this platform,

30:29

support for some built-in, I don’t know, macros, additional data types, platform descriptions. And companies somehow believe that this is their proprietary thing, that they are not ready to give it to the LLVM Mainline, so they have to support it downstream.

30:47

Probably, all those who write LLVM in an open-source format are in high demand in the market of such companies that need help in customizing these solutions?

30:57

Well, yes and no. That is, in fact, one could say that, in principle, the job market around compilers,

31:08

That is, it has been quite hot lately.

31:12

Partially, of course, this is related to the rise of

31:16

AI/ML compilers, that is, for things around artificial intelligence.

31:22

Because at some point, the realization came

31:26

that what was previously called a machine learning framework

31:31

is actually 80% of it is, in one way or another, a compiler

31:37

in one form or another. That is, with its own specifics,

31:42

that in the case of AI/ML workloads, we usually deal with dataflow,

31:47

rather than contraflow, as in regular languages.

31:51

But nevertheless, it is a compiler.

31:53

Thus, as soon as the explosive growth happened,

31:56

everyone became AI/ML engineers.

32:00

Naturally, there was a very high demand for people

32:04

who understand something about compilers,

32:08

what is happening with them, and, of course,

32:12

a certain expertise emerged in the market.

32:17

I want to move on to the topic of bioinformatics at this point.

32:21

What is it in a nutshell?

32:23

I find it very interesting, but I don't understand.

32:25

Perhaps it’s worth first explaining to the listeners,

32:29

And why did we move on to this topic?

32:32

Probably, we will do it that way, because it didn't work out like that,

32:35

that we are now going to start discussing ballet or modern jazz arts.

32:41

I also find it amazing,

32:43

that you are a specialist, on one hand, in compilers

32:45

and on the other hand, bioinformatics.

32:47

I can't connect these two topics.

32:49

In fact, the situation is very simple, on one hand.

32:53

Perhaps it’s worth starting with the fact that, unlike many people,

32:57

who are in this field, I do not have a specialized education specifically in computer science.

33:06

I have an education, essentially, in applied mathematics

33:10

and in university, I was essentially engaged in computational statistics.

33:15

On the other hand, actually, 25 years ago, it was a kind of hobby

33:23

or a way to switch gears, to do something different.

33:27

On the other hand, in reality, I have always been at a relatively low level

33:33

in terms of programming.

33:35

It's not that one was primary or secondary,

33:40

they are roughly parallel.

33:43

In fact, the understanding of how compilers work,

33:47

how data structures are organized, how to compute something accurately, correctly, and quickly,

33:52

it helps a lot because when we need to accelerate a computational method mathematically,

33:59

you start to understand a priori how it needs to be structured under the hood,

34:04

how to make it work quickly, well, accurately, and so on.

34:09

And, in principle, I still sit on that second chair

34:16

in terms of the same statistics, teaching, and so on.

34:22

My candidate thesis was specifically on computational statistics.

34:26

Then, at some wonderful moment, a little over 10 years ago, in 2012-2013,

34:35

I meet my former classmate, who says something like this.

34:44

We are opening a bioinformatics lab in St. Petersburg and would you like to join us?

34:52

My classmate at that time, we studied together in the first profession, was Kolya Vesetov.

34:59

He is currently mostly involved in an educational project.

35:05

It's also called edutech.

35:08

If you've read or heard about it, then Rozalint and Stepik are all his creations.

35:14

The feature of the educational laboratory was that, despite the fact

35:21

that it was supposed to focus on bioinformatics tasks,

35:26

it was built in a somewhat unconventional way.

35:30

That is, bioinformatics is essentially at the intersection of computer science and biology.

35:37

And there are roughly two approaches to building something here.

35:41

You can take, conditionally, biologists and scientific programmers.

35:47

Or you can take computer scientists, so to speak, people who have an education in computer science.

35:55

and teach the necessary things related to biology.

35:59

As you can guess, the results turn out to be completely different.

36:04

Because in one case, when we teach biologists,

36:10

the primary focus is on some deep aspects of data analysis.

36:16

And in the case of teaching computer scientists, it will probably be more about developing tools.

36:25

That is, when we take and write something, develop some methods, some programs for all of this.

36:33

At that time, it wasn't that there was a lot of time, but I saw some tasks in computer science that seemed interesting to me.

36:43

Plus, they also touched on aesthetic aspects.

36:47

So it all kind of came together, and I sort of stepped in with one foot.

36:52

And then, as always, it began.

36:58

I saw that there are some areas where I can be of help.

37:03

To explain how it should be structured in order to work well and quickly.

37:09

And for the listeners to understand, the main product of the software lab was a genome assembler called Space.

37:21

If we try to tackle the genome assembler, we can probably use a standard metaphor.

37:31

Imagine that we went to a printing house where they still print paper newspapers and took a pallet of newspapers that have just come out before tomorrow's issue.

37:46

After that, we carefully placed a couple of shredders under this pallet of newspapers so that everything would be torn into tiny pieces.

37:57

And then, having a set of these pieces, we need to try to reconstruct the original issue of the newspaper.

38:04

I understand that, firstly, we may have some fragments of phrases.

38:09

I realize that these pieces may have holes, slightly burned edges, and so on.

38:15

Nevertheless, we understand that we have some redundancy.

38:18

That is, we didn't just tear one newspaper into pieces, but an entire daily issue at once.

38:26

On one hand, this redundancy helps us because we can hope that a single error won't ruin everything.

38:36

On the other hand, naturally, this redundancy hinders us because we lack some clarity.

38:42

In other words, we are not assembling just one puzzle, but we are trying to assemble one puzzle from a bag that contains 50 or 100 mixed puzzles.

38:51

At the same time, all the puzzles are different, but the picture on them is the same.

38:57

The peculiarity of this is that the raw data is measured in gigabytes and terabytes.

39:05

And we need to extract something from them that will be correct and accurate.

39:13

Plus, we do not know the correct answer.

39:15

This is yet another aspect that is more or less standard for analyzing any data.

39:22

And it so happened that Space, as a tool, became one of the standard tools for genome assembly worldwide.

39:31

Due to its universality and the quality of its results.

39:40

And for some time, I was probably translating it into a programming language familiar to the listeners.

39:51

I was a tech lead in the team that was developing the assembler.

39:58

- And what does it ultimately assemble? I understood the newspaper metaphor.

40:03

- Genomes. We have something called a genome.

40:10

If we remember what someone was taught in school, we recall DNA, the spiral, and so on.

40:18

If we abstract from some biological aspects, we are dealing with long strings.

40:30

We have a string in the alphabet of 4 letters. The interesting strings for us are very long.

40:38

Bacterial genomes are several million nucleotides long, meaning several million letters.

40:44

And we need to reconstruct the complete sequence from individual fragments, from separate pieces.

40:52

To make it clear, our fragments are 100, 200, 300 letters long,

40:59

and from them, we need to reconstruct a string that is millions and billions of letters long.

41:05

- What does the practical task look like? Where do these genome fragments come from?

41:11

What is the output and how is it used later?

41:15

- Practically, this is actually a completely separate story.

41:19

It starts with what are called wet biologists, when, for example, bacterial or simply.

41:27

That is, a specific culture is taken. After that, a bacterial culture is isolated from it.

41:31

in large quantities. In principle, DNA can even be isolated at home,

41:36

there are videos on video platforms, on YouTube for sure, I don't know about others.

41:43

There were a bunch of videos on how to isolate DNA at home.

41:47

Usually, this is done with some polyploid berries, like strawberries.

41:53

In principle, it can be isolated at home using

41:56

fairy liquid, dishwashing detergent, and something that allows for

42:04

lysis of the cell wall. That is, practically, it’s a marinade that

42:09

allows us to mark some spot.

42:14

And upon arrival for isolation, large complex boxes

42:21

called separators are used, where everything is loaded and which try to...

42:27

They are built on different technologies, there are about three and a half technologies

42:31

that allow us to read the sequential verses in letters.

42:35

But the essence of the data, probably, is that the separator takes these long ones,

42:39

long, long sequences allow us to obtain shorter sequences.

42:47

Not always accurately, not always correctly. But nevertheless, this will be the outcome of the process.

42:54

Knowledge of sequences allows for many things.

42:57

That is, if we... So, this starts with...

43:01

Many things are determined, one way or another, by various

43:06

genetic traits. So, at this point,

43:12

it is already... That is, genome analyses in one form or another

43:19

have become quite routine.

43:23

We can analyze whether the strain of this bacterium is associated with any disease.

43:30

That is, does this bacterium have any plasmid?

43:35

or something else. That is, some separate element of the genome,

43:39

that determines increased antibiotic resistance.

43:44

We can talk about screening for various congenital conditions specifically.

43:49

genetically determined diseases. We have with you and all the listeners,

43:53

Fortunately, we have two copies of the genome, one from dad and the other from mom.

43:57

On one hand, this redundancy allows us to be more or less tolerant.

44:03

to many malfunctions. After all, the process of telomere replication is still not perfectly accurate.

44:08

But, in principle, this is the heaviest legacy in some cases,

44:14

when they are drawn in the sky in such a way during takeoff,

44:17

that some genetic anomalies or genome malfunctions, they come here.

44:24

Mom doesn't have it because it's written in one copy, dad doesn't have it,

44:29

because it's written in one copy. But, unfortunately, for the child, it so happened

44:33

that bad things came from both sides at once, and we don't have redundancy

44:42

and something doesn't work. In this sense, often for confirmation, for example,

44:49

they perform what is called a Trio analysis, where they read and screen the genomes of the parents

44:56

and the child to understand what actually happened.

44:59

Is it truly de novo? Or is it a recommendation from the parents?

45:06

So, in principle, this is more or less the mainstream at the moment.

45:10

And in various aspects, this has already become routine in analysis and in medicine,

45:19

and in biological research, that is, in a bunch of fields.

45:23

That is, it shouldn't be related to the studied disease.

45:26

People who are involved in agriculture in one way or another

45:30

are also interested in studying resistance

45:35

to the same pests of agricultural crops.

45:40

For example, all the bananas that exist in the world are actually clones.

45:46

They are one variety. So in the case of bananas, there is no species diversity.

45:51

Specifically, the industrial ones that we can go and buy in the store.

45:58

That is, if a wonderful pest appears,

46:02

for which the bananas become non-resistant,

46:08

we risk ending up in a situation where bananas simply cease to exist in the world.

46:13

And that's why this is one of the interesting areas where they are really trying

46:18

to develop significantly new varieties, but, of course,

46:22

without losing the qualities that we are used to.

46:27

No one wants to eat unappetizing bananas.

46:30

In fact, this has already happened more than once regarding the same bananas,

46:36

because the epidemic that wiped out the variety,

46:41

I might be mistaken, but it was around the 1930s of the 20th century.

46:48

Or, for example, a very interesting task is the attempt to analyze all sorts of,

46:55

what is called microbiomes, when things are analyzed in combination.

47:02

That is, right now in nature, everything is very interconnected.

47:06

We mentioned that there are things related to duplication.

47:12

In nature, on the contrary, duplication often occurs.

47:15

There is a situation of symbiosis, when certain living organisms lose a specific function,

47:21

simply because they have symbionts that perform that function for them.

47:26

I mean, if we look at all these guys individually,

47:32

they are not viable. We cannot grow them in a test tube or on Petri dishes separately.

47:40

It has to be a mandatory symbiotic colony, so that they actually support each other.

47:46

Well, for example, a very good example is when, say, in Russia there are areas of taiga,

47:54

which are called black taiga, which is very interesting because there is observed a phenomenon known as

48:04

plant gigantism. That is, in reality, there is grass growing taller than human height.

48:12

Without any additional factors.

48:16

Naturally, this raises an interesting question: what is the reason for this?

48:19

Because these are not some separate varieties; it is just the most ordinary.

48:25

Grass that grows in meadows and fields across the country, in this specific place, for some reason,

48:32

and it is specifically in the taiga areas, grows taller than human height. What is the reason?

48:38

If we understand this, we can immediately guess that it provides us with very significant,

48:44

what are called interesting applications for growing new crops,

48:52

for increasing yields and so on. So far, it is known that the reason is the soil.

48:58

But the exact cause is unclear. That is, if you take a bag of soil from there and plant something in it,

49:04

you get a simple radish that grows two to three times larger.

49:10

The reason is unknown. There is a hypothesis that it is a unique soil microbiome,

49:19

meaning that there is a specific family of soil bacteria present,

49:25

which perform some kind of magic, producing certain bioactive substances

49:29

that stimulate this. In general, there are many applications.

49:35

One could talk about them for hours, but all of this is impossible without...

49:44

Actually returning to bioinformatics and LLVM, it is impossible without tools.

49:50

If we don't have a hammer, we can't drive a nail. Driving nails by hand, especially with your head, is not very pleasant.

49:59

So we need tools. In the world of compilers, we have the complete LLVM.

50:04

One of the tools we developed was a dynamic assembler.

50:11

This is a tricky tool that researchers have used and continue to use all over the world.

50:17

And, perhaps without false modesty, one can say that this is a kind of Russian brand in bioinformatics.

50:27

Because the article in which the original assembler was published has been cited over 15,000 times.

50:36

And this article is among the top 50 most cited articles in the world.

50:47

So this is indeed something, as they say.

50:50

Fundamental.

50:52

Well, one could say that it is a widely known thing, but within narrow circles, which is fine.

50:59

That's great.

51:01

I would like to get your advice for the youth.

51:06

What should one focus on now in the field of compilers and bioinformatics to achieve such success in 10-20 years?

51:15

What do you think is still unexplored? What currently requires investment?

51:21

You know, I think that's the wrong approach.

51:26

Specifically, a list of 10 points. We saw that some time ago.

51:35

What you need to do to enter IT.

51:40

That's the wrong approach.

51:44

Yes, you can read a ton of books. About 10 things you need to do to be rich, happy, smart, kind, beautiful.

51:53

According to the text. But this approach is incorrect. The right approach is that you need to have the desire,

52:02

the opportunity. You probably need to learn, as cliché as it sounds. Everything is changing.

52:11

That is, what we were doing 10 years ago, 15 years ago, 20 years ago, is certainly sometimes relevant,

52:17

but often it is not. You cannot constantly live in the past. It remains in the past. You need to have

52:23

the ability to constantly understand what new things are happening, to be able to adapt.

52:31

to adapt and so on. Again, we see explosive growth in various things,

52:37

related to all sorts of agents, assistants, and everything else. Perhaps,

52:44

In about 10 years, what will actually be needed is, yes, the ability to strategize correctly.

52:51

questions in order to get the right answers. But on the other hand, this is also difficult.

52:54

That is, you need to learn to ask the right questions, and in my opinion, this is even more important.

53:03

Being able to assess the adequacy of the answers. Because automating routine tasks is great,

53:13

but when someone does something for you, you need to understand that how you received it, it is...

53:19

remains yours. And you start to be responsible for it. Perhaps you know what is currently lacking in that...

53:28

in the LLVM ecosystem, what one would like to do, but there is no strength, no manpower, and then the youth would get involved,

53:34

and that would make it happen. Some tool, maybe some methods, maybe some...

53:40

extensions, add-ons. The same MLIR, it appeared not so long ago. It was a new idea,

53:46

surely created by those who... The idea was new, but essentially, it was a deep

53:56

reworking of old ideas. That is, we are not saying that we invented something completely

54:01

new, something that had never existed before. The concept remains the same. We are just

54:07

reinterpreting them and learning from our own. In the same LLVM, there are small wagons, a cart, minor

54:14

tasks that can be taken on, and it's called figuring things out. That is, to understand how everything is arranged,

54:20

and then everyone finds something that suits them. Some people realize that they much prefer to work

54:29

very, very close to the machine code itself, with backends and everything else.

54:34

Others say, no, no, no, this is terrible, terrible, terrible, I don't want it, I don't want it. Leave your bits,

54:39

bytes, and registers to someone else. I'm interested in all sorts of high-level things, I don't know,

54:46

checking, yes, I don't know, checking protocol matching against a template, and similar things.

54:57

Everyone has their own variety. Moreover, if we talk about large projects, for example, in the same LLVM...

55:05

there is LLVM, actually, it has been around for a long time, since 2008, participating in the same Google Summer of Code,

55:15

that is, a program where large companies actually pay for such projects,

55:26

for summer projects, but they are done in open source. Unfortunately, Google Summer of Code is closed for

55:32

many listeners from Russia, from Belarus, but in principle, similar things exist, as it were.

55:41

open projects still remain, so in fact, no one prevents you from taking, for example,

55:47

take a project that no one has done, or that no one has taken, and try to create something free there.

55:55

Time, so that option also remains. And in general, how does the LLVM community relate to newcomers?

56:02

Aggressively? No, very well, as always. The community treats newcomers as it is treated.

56:13

actually, those who come to it. That is, the community tries to help newcomers by answering...

56:19

to reasonable questions, as long as people in front, so to speak, are good and reasonable. Well, in principle,

56:26

like any healthy community, where there is constant rotation. That is, it is necessary to...

56:34

to understand that the LLVM community, despite the fact that it is an open source project, is still somewhat...

56:41

about 60 percent consists of people who, one way or another, work somewhere for someone and receive...

56:48

a salary for it. So, in one way or another, there is a constant rotation, and it's not like that,

56:54

that the veterans sit for 30 years and don't let anyone into their sandbox. That simply doesn't happen due to natural...

57:05

reasons. Some say it's bad that 60% of the people drive what their companies need,

57:15

but there are no universally good things. That is, yes, there are people who, of course, do what...

57:22

that will benefit their employer. Yes, that's true. On the other hand, again, today one...

57:29

will lead, tomorrow something else, the day after tomorrow a third. This ensures a healthy rotation, people move on...

57:34

from one employer to another, meaning they remain, among other things, in the community, and that is also valuable.

57:39

Would you recommend young people to get involved in the LLVM project? Or is it too difficult and unpromising?

57:46

Well, promising, I would probably say the opposite, that it is very promising. That is, especially,

57:54

if adding a simple line to a resume, yes, that I was a Contributor in LLVM, I did this and that,

58:02

this and that, it often gives a significant boost in terms of quality right away.

58:13

resume, yes, in terms of quality of the series. Moreover, given that all development is currently being done on...

58:20

these platforms, yes, it's very easy to see what a person has done, yes. That is, this...

58:26

that is, in other words, it allows for independent verification of what is written in the resume, because...

58:31

it can absolutely be anything, yes, I don't know. The main lead, the maintainer of LLVM,

58:37

yes, you can write it like that, yes. So, we take it, check it, that is, it's quite a standard...

58:45

thing, so that if a person writes that they have experience with LLVM, okay, good, we take it, look at what they...

58:49

was working on, when they were working on it. What, and you can immediately see, among other things, the quality of their work, and how they...

58:56

for example, communicated with reviewers during their review time and similar things. In LLVM, this is quite...

59:03

promising, and, one way or another, this is what is, well, what is under the hood of many compilers,

59:12

which everyone uses every day. And there are tasks of different angles, different sorts,

59:20

that is, there are very small ones, yes, there are larger ones, yes, there are vast ones. Probably, for any, anyone...

59:30

there will be something that resonates with them. Well, that's great. Now, I want to ask the last question. Maybe...

59:36

let's give some kind of advice to those who listened to us, and to those who are interested in your...

59:41

areas. Maybe you need contributors, maybe you need help from those who are ready to get involved...

59:46

in bioinformatics, in these projects, maybe in LLVM, in your area within LLVM. Do you need such...

59:54

Do you need people? Is help needed? Yes, that is, plus, again, all these projects are open, so no one is...

60:03

is stopping anyone from taking a look at what problems exist, and there are probably always open hands available. That is, in this...

60:10

there is no such thing as "no, no, no, don't come, this is my sandbox." No. So, if...

60:18

if someone is interested, of course, then welcome. Everything is available, you can take it, get involved.

60:26

and fight, to do something. Well, okay, I hope our conversation inspires someone to...

60:33

to engage in this topic. Many are afraid of large open-source projects; they think it's difficult there,

60:38

there are indeed experienced people and huge codebases, and for a student to break into that, for example,

60:43

third-year students tend to feel very intimidated. It's true, but you really need to overcome that, yes, that is,

60:53

indeed, some people are afraid, yes, like, oh-oh-oh, this is all very big, I'm very scared, I don't know anything...

61:00

I understand. An important skill, in terms of skills, is that you need to learn to work in conditions of incomplete...

61:10

information. That is, you don't need to understand all of Chrome in order to contribute something to it. You don't...

61:15

you don't need to understand all of LWM in order to contribute something to it. You don't need to understand all of QC in order to...

61:19

to contribute something to it. And you shouldn't think that there is something unique about large open-source projects,

61:27

yes, but again, yes, at the university they only teach a small piece, so to speak, of those...

61:35

or other aspects of mathematics, yes, but at the same time we somehow are not afraid of all of this, right,

61:41

we say, oh, well, I don't know everything, so, like, I'm scared, I won't engage in this, yes, that is,

61:48

well, it's the same everywhere, yes, that is, it's impossible to know everything, and that's okay, yes, that is, but it's important to know...

61:55

it's good to know at least something around you, and that's enough. Thank you very much for your time,

62:05

I hope we helped someone, inspired someone, motivated someone. I hope so. That's it, Anton, goodbye. Yes,

62:13

Take care. Goodbye. Thank you to everyone who listened.

Interactive Summary

The video features a discussion with Anton, an expert in LLVM and bioinformatics. He explains LLVM as a framework for building compilers and tools for code analysis and transformation, clarifying that it's not just an intermediate language but a comprehensive ecosystem. Anton shares his personal journey into LLVM development, starting with a need for code transformation tools and his significant contribution to adding Windows support. He contrasts LLVM with GCC, highlighting LLVM's flexibility and modern design. The conversation then shifts to bioinformatics, where Anton explains his background in applied mathematics and computational statistics, and how he transitioned into bioinformatics through a research lab focused on tool development. He uses the analogy of reconstructing a shredded newspaper to explain the challenge of genome assembly, a key area where his team developed the widely cited 'Space' assembler. Anton elaborates on the practical applications of genome analysis in medicine, agriculture, and understanding complex biological systems like microbiomes. He emphasizes that LLVM and bioinformatics, despite their complexity, are accessible through well-developed tools and a supportive open-source community. Finally, he offers advice to aspiring developers, encouraging a proactive learning approach, adaptability, and the courage to contribute to large open-source projects, assuring that newcomers are welcomed and their contributions valued.

Suggested questions

8 ready-made prompts