How to get started with transformers and NLP

Aleksa Gordić
21 min readDec 5, 2020

As I often do, I’d like to share my personal story before jumping into the “core part” of the blog — if you only came to read about the transformers/NLP resources please skip this section (I still love you! ❤️). So here we go!

This year I decided to structure my deep learning journey really neatly. It all started in February 2020 with Neural Style Transfer.

After studying it for months, reading many research papers, blogs, watching videos, etc. I started my own YouTube channel “The AI Epiphany” and I kicked off the NST series:

My first YouTube video

This was the first-ever video I made — not the one I am the proudest of but it was the first step and a step in the right direction!

My idea for 2020 was basically this:

I am going to take a full year to cover the 6 areas of deep learning (I’ll mention which ones a bit later) that are super interesting to me, and I’ll gain an even stronger understanding of deep learning, ML, and mathematics.

I’ll do that by covering both the theory (usually via the top-down approach) like reading blogs, watching videos, reading books, and all the relevant research papers particular to the current topic I’m working on, and by coding relevant projects from scratch (mostly reimplementing papers) in PyTorch.

This is where many make a mistake — until you’ve tried to get your hands dirty by developing some project on your own, you don’t know how many details you actually don’t know! Believe me on this one…

The devil is in the details

Aside from that, I won’t make the same mistake I made throughout my life, that is I’ll share my learning journey with the world along the way.

Until that point, I always thought I should first become a world-class expert before I start teaching others (and it also seemed like an unnecessary distraction at that point in time), but then I realized (at least) 3 things:

  1. You learn a looooot by teaching others (that’s 5 ‘o’s yes, a lot.)
  2. You are best positioned to help people one step behind you (probably)
  3. A whole world of opportunities opens up once the broader community becomes aware of your existence (most certainly)

So, how I like to think about it is I have 3 main content platforms, namely:

  1. My YouTube channel The AI Epiphany where I’ll keep on sharing everything AI related that I find important (I’ve been mostly covering deep learning research papers lately and I’ll continue doing that)
  2. GitHub where I open-source all of my projects (I love this platform ❤️)
  3. Medium where I occasionally write (I definitely spend the least time here)

And I reshare that content on a lot of places like Facebook AI groups, Reddit, etc., but my central social media locations where I also share my thoughts are definitely LinkedIn (I was really bullish about it in 2020) and Twitter.

I only just recently started sharing my thoughts more actively on Twitter — as of today actually (the day I wrote this blog 😅).

The AI community is super strong on Twitter. I’ve been following AI people there for 2 years but I wasn’t active myself. That’s about to change!

So long story short my Neural Style Transfer field exploration ended up with me reading 20+ research papers, uploading 7 videos in the NST series, and open-sourcing 3 GitHub projects:

Here is a nice neural style transfer pic for you!

You can create these using some of the projects that I’ve linked above

I later covered the Deep Dream algorithm (GitHub Deep Dream). And yeah a fun fact — I used this one to create my YouTube/GitHub/<fill in the blank> banners so I do use my own software you know!

Here is a psychedelic pic I generated (I find DeepDream truly fascinating):

DeepDream algo I developed, you can create it using this code.

But. I hear you asking:

“But why NST and DeepDream, those things are useless?”

And they are.

I’m joking a short answer would be: they’re fun & awesome and are definitely useful to artists, you’ll get really familiar with ConvNets and a DL framework of your choice, and you’ll probably learn a bit about interpretability.

Also, many ideas from NST appear elsewhere. Like the famous StyleGAN paper used an idea from the NST literature (adaptive instance normalization).

I love art and I’m a highly visual type of learner (I think so at least) so in my probably overly biased opinion, I think these 2 projects are a great way to get your hands dirty with deep learning.

The third area I explored was Generative Adversarial Networks (GitHub GANs). Here is some latent space exploration for you:

from man -> woman (notice that the rotation and skin tone change as well)

I learned a lot about GANs and they are a fascinating field. So you can probably expect me doing something related to GANs in the future.

And finally, I just “finished” exploring transformers and NLP field in general! Unfortunately, I haven’t been sharing my journey on Medium from the very onset, but I still have 2 topics left (see my next steps chapter at the end of the blog) so I’ll hopefully be at 50% coverage! 😅

Note: Aside from NST which took me much longer and Deep Dream which took less time, it usually takes about 2 months of intensive learning for me to feel good about my understanding of the topic. I also work full time at Microsoft — so it’s pretty intensive 😆, but I like it a lot — I’m probably sick.

OK! That was a short attempt to compensate for the lack of 3 thorough blogs. Hopefully, some of you found it useful or motivating I dunno.

Now let’s jump to NLP and transformers. I’ll share the theory I found useful, some videos I created so far (which could further help you), and what I’ve learned reimplementing the original transformer paper from scratch.

A lofty goal. Here we go!

Personal update: I just created a monthly AI newsletter and a Discord community! Subscribe/join those to keep in touch with the latest & greatest AI news!

The trigger point — Vision Transformer (ViT)

I’ll tell you how I approached learning this subfield of deep learning. You’ll probably get the most out of it if you already have a decent background in some other subfield of deep learning. For me, that was computer vision.

So I initially wanted to cover GNNs first but on October 3rd the Vision Transformer paper (ViT) was published — and since I’m really bullish on computer vision and this paper pretty much wrecked CNNs I decided I should get a solid grasp of transformers sooner rather than later.

So I went ahead and read the Vision Transformer (ViT) paper. At least I tried. In the first go I didn’t understand the details (I did understand the main point of the paper though) as it had a hard dependency on the original transformer paper so I took a step back and I read the original transformer paper:

Attention Is All You Need.

Now at this point, I realized many gaps in my knowledge so I took a further step back and decided I needed to learn about attention/self-attention as well, word embeddings, tokenization, etc.

I did have some notion of all of those concepts — but it was far from a deep understanding (no pun intended).

Enter the dependency hell — Aleksa Gordić, October-November 2020

I also knew I wanted to understand the main trends of NLP. BERT/GPT/Transformers were the recurring topics I kept hearing about, even though I was nicely cushioned in my small bubble world of computer vision.

So, on one hand, I wanted to understand NLP and its main trends and on the other hand, I really wanted to understand the Vision Transformer. Those interleaving goals helped shape my strategy.

Over the next section, I’ll group the relevant papers, blogs, and in general different resources, which helped me in achieving those goals.

I warn you — there was a lot of back and forth between different “groups” of resources and subtopics. Learning is far from a linear process.

Photo by Ricardo Viana on Unsplash

Sometimes you’ll read something and you won’t understand anything. So you’ll go to a completely new subtopic, new connections will form, the old ones will sink in, you’ll return back, re-read that something again and understand it a bit better.

Sometimes you’ll quickly skim papers just to get the bigger picture and form that initial skeleton of the knowledge graph and only later will you thoroughly read it.

Sometimes you’ll read the paper, then you’ll take a step back, read some blogs or watch videos that contain higher-level explanations, and get back to the paper again.

There are lots of back and forth both vertically, between the low level and high-level explanations (across the knowledge pyramid), and horizontally across different subtopics (e.g. attention/tokenization) of the subfield (e.g. NLP) you’re learning.

That’s the reason you shouldn’t consider the list of resources that follow as a chronological list which you should linearly follow in order to learn NLP.

Although that approach might just work for you — who knows — the trick is to never give up!

NLP/Transformers learning resources

Here we go. Let’s start with attention — that’s a key concept that you’ll be seeing everywhere going forward. Make sure you deeply understand it.

Attention/self-attention/50 shades of attention

Photo by Markus Spiske on Unsplash (Attention!)

Blogs (sorted approx. from the best ones towards less valuable ones IMO):

The first link will probably suffice, it’s an amazing resource. Feel free to go at your own pace and check out the other links as well (I’m only linking stuff I read myself and found to be of a good/great quality).

How I would probably go about reading this blog if I were you is I’d first read it without stopping, top to bottom, and then I’d return back to sections/links that I think are the most important for me. Nice! Let’s continue!

Low level (papers) that introduced attention to the field:

Once you grasp these 2 papers you’ll understand attention.

Next, I’ll introduce a couple more concepts that are used in transformers.


Photo by Max Delsid on Unsplash (Let’s chop those words/sentences)

Tokenization is a world for itself. In most transformers papers it’s just mentioned in one line maybe but it’s really important. That being you can treat it as a black-box most of the time. Here is the best resource I found:

This one from the famous Stanford professor Christopher Manning is also really nice.

Word Embeddings

Photo by Amanda Jones on Unsplash

This one is necessary to better appreciate the deep contextualized token representations we have nowadays with pretrained transformers.

Word embeddings played an important role in NLP history.

The drawback these representations have is that they aren’t contextual. What I mean by that is that e.g. Word2Vec or GloVe would give you the same embedding vector for the word/token “queen” in both of these sentences:

  1. “Queen Elizabeth II has ruled for longer than any other Monarch in British history.”
  2. “The Queen Bee plays a vital role in the hive because she is the only female with fully developed ovaries.”

Even though the semantics i.e. the meaning of these words are totally different. Here are the 2 best high-level explanations I found:

And here are some papers worth reading (pick only one, you don’t need both I skipped GloVe, you just want to get the idea of how these were trained):

At this point you could probably take a look at this seminal paper written in 2003 by the Turing award laureate Yoshua Bengio:

It’ll help you further consolidate your knowledge.

Contextualized shallow representations

I mentioned the “queen” example and how we're failing to capture the meaning of the word based on the context.

Can we do better than Word2Vec and GloVe?

And so, contextualized word-embeddings were born. The first one that I’m aware of was this method called CoVe:

These CoVe embeddings were contextual but they still had to develop task-specific architectures and they only used the last layer information to solve the downstream NLP tasks.

Then came ELMo which further refined the idea:

They noticed that different layers contained different representations of semantic/syntax information and so they used all of those representations to solve the downstream NLP tasks and that helped.

ELMo gained some serious traction, TechCrunch even made a blog about it:

Now around this point in time, things were getting really intensive — lots of interesting research was happening in parallel. Transformer happened in 2017, a bit later GPT and BERT came to the scene — the rest is history.

Preceding those events, there are 2 more important dots we need:

Somewhere along this point of time when ULM-FiT appeared, the NLP world started shifting. Something strange was happening. The NLP field was entering its “ImageNet moment”. All of the previous research slowly enabled the power of full-blown transfer learning in the field of NLP.

Researchers have figured out that we can do transfer learning the same way it was previously being done in the computer vision world since 2012 — when the famous AlexNet/ImageNet moment happened. Basically, the CNNs (re)appeared and left every other approach in the dust.

The ImageNet moment

You could see the saturation of the “top-5” error metric happening just around 25. Those were the old hand-crafted ML methods. Then CNNs made a quantum jump in 2012 and this wasn’t just the start of a new era in computer vision, this was the start of the deep learning era/hype as well.

We’ve known about CNNs since the early 70s so what was different this time? Well, powerful hardware and lots of data (like the ImageNet dataset). And a handful of people with a huge dose of “belief” that this time CNNs will shine.

Sebastian Ruder nicely documented this moment of NLP history:

Sebastian also has (had? the last one happened on April 2020) an awesome monthly newsletter so do check out that one (or subscribe) as well:

He’s a scientist at DeepMind and really famous in the NLP community, it’s always a good idea to surround yourself with the best from the field if you want to stay up to date.

And now put out the red carpet.

Photo by Rob Laughter on Unsplash (Let the show begin!)


Mighty transformers. This was the paper that helped start the revolution — The “ImageNet moment” of NLP:

I created a detailed explanation of the paper that can help you understand all of the nitty-gritty details (I’ve got a lot more transformers-related resources that I’ve created scattered across the rest of this blog post):

A deep dive into “Attention Is All You Need” paper

And going further, This Blog Is All You Need:

Here are a couple of nice videos as well:

For a really nice recap of attention I’d recommend this blog:

It’ll also introduce you to the concepts of Neural Turing Machines and Pointer networks. And finally, if you want to get your hands dirty with some code, here is the project I developed and open-sourced:


Let’s continue our journey through the magic world of NLP.

I hear you cry “but BERT is just a transformer!”. No, it’s not! They even coined a separate term for exploring BERT-like models — “bertology”. Nice, right?

Anyways BERT is architecture-wise the same thing as the above transformer (it took the encoder portion of the architecture).

What it introduced were some details like e.g. the “class” token, and a new pretraining objective masked language modeling objective or MLM for short.

It was also bidirectional compared to the unidirectional/causal nature of the original transformer. That gives it an edge in many NLP tasks where it’s useful to have both the left and right context — the most obvious one predicting a word in the middle of the sentence (fill in the blank task).

And again this blog is the best resource I found:

Also, Google published 2 nice blogs on it:

If you like the video format this one is a nice high-level overview:

That’s BERT. Now, the second really famous family of models was of course the GPT family.

GPT (Generative Pretrained model)

GPT has a totally different approach and different goals from BERT. It’s an autoregressive language model. It’s causal and it’s unidirectional. It’s really good at generating text — especially long stories humans have trouble distinguishing from human-written stories (particularly the GPT-3 model).

Architecture-wise GPT basically took the decoder part of the transformer model (contains causal masks — i.e. tokens can’t look ahead) and pushed it to its limits.

The final iteration of the GPT family, the famous GPT-3 model, is basically competing for the position of the most general language model. No fine-tuning is needed for downstream NLP tasks. Pretrain once and use it for everything.

Again this could easily be a separate blog post. Here are the papers:

  • GPT 1 (Improving Language Understanding by Generative Pre-Training)
  • GPT 2 (Language Models are Unsupervised Multitask Learners)
  • GPT 3 (Language Models are Few-Shot Learners)

I recently also covered the monstrous (75 pages long) GPT-3 research paper on my YouTube channel as well:

Overview of the GPT-3 paper and the hype and anti-hype around it

I only recently started doing these paper distillation videos so if you found them useful I’d appreciate your feedback and I’ll start doing even more of those. Yannic also did a nice video!

Again Jay did a marvelous explanation on his blog:

And TBH I really like OpenAI’s blogs:

The AI development and compute power relationship is rather interesting so I thought it’d be nice mentioning that blog as well.

Finally here are some blogs covering the hype-anti-hype spectrum.

Verge blog — a nice collection of what happened on social media and in general around GPT-3 after it was first published.

This one nicely explains the topology of the hype-spectrum, this one gives a nice overview of the hype also (and model limitations, etc.).

You can see some of the “GPT-3’s pitfalls” here and finally by far the most thorough resource out there is that by Gwern:

Gwern’s text is a bit more difficult to parse IMO, but it’s got all of the details you’d want to know about GPT-3's real-world behavior (compared to non-intuitive objective metrics like perplexity, etc.)

Finally, here are a couple of GPT apps you could try out:

Is there life after BERT and GPT?

Photo by Marlon Nartea on Unsplash

You bet there is! First, it’d be worth checking out these 2 nice overview blogs.

Here are some of the interesting papers I’ve read:

  • TransformerXL (Google) (smart way to widen the context of LM)
  • XLNet (Google) (proposes a way to unite BERT’s MLM and GPT’s LM objectives, uses Transformer XL as a backbone, 512 TPU v3s)
  • RoBERTa (FAIR) (strong dependency on BERT and XLNet, 1024 V100s)
  • ALBERT (Google) (smart way to get more efficient BERT, ≤512 TPU v3s)
  • T5 (Google) (the best overview of the transformers landscape, TPU pods!)
  • Uni LM (Microsoft) (introduced prefix-causal masking, 8 V100s)

I intentionally wrote the affiliation next to the paper just to paint the picture of how the current NLP landscape looks like. 90% of the highly influential research came from Google (including the original transformer).

I’ll stop here there is a lot of things that can be said on that topic! 😅

Next, I wrote the amount of compute that’s needed. Again I won’t go there either! (nervous chuckle).

You should probably readTransformerXL and XLNet in a batch. XLNet is super hard to understand IMO (and others share my opinion haha) and it’s also hard to reproduce. T5 later showed that it’s debatable whether we need those complications in the first place, but it’s still worth a read.

RoBERTa did a combinatorial space exploration around the original BERT and showed that with more data, compute and some hyperparam exploration you can gain much more out of BERT.

ALBERT experiments with parameter sharing, embedding matrix decomposition, and gradient checkpointing (to save up some memory). All of this leads to a much smaller memory footprint.

Finally, If there is one paper you should read that would be T5. Google also published a nice blog on it. I also really liked the Uni LM paper (not because I work at Microsoft but it’s truly a nice read!)

The second group of papers I’ll mention here have to do with making transformers more efficient (memory/time complexity):

  • SparseTransformer (OpenAI) (introduced new attention patterns, O(n*sqrt(n)) compare that to original transformer’s O(n²), 8–64 V100s)
  • Longformer (Allen AI) (introduced new attention patterns, 8 RTX8000)
  • Reformer (Google), (introduces LSH, chunking, gradient checkpointing, etc. Here is the official blog, a nice overview (cool animations) and a nice in-depth explanation, 8 TPU v3 cores)
  • Linformer (FAIR) (low rank approximation, O(k*n)~ O(n)) (64 V100s)

You can see that the necessary HW for training these ones is way smaller than for the above group of papers (that’s why I’ve put ALBERT in this other group).

Phew let me stop here. That was a lot of theory to go through. I explained a part of the process of how I approached learning all of this in this video:

How I approach learning a new deep learning field

What I’ve learned implementing the transformer

Reading through a bunch of excellent theory I still knew there are details that I don’t understand and so as always I wanted to implement at least 1 project in the field I’m interested in.

So after ~3 weeks of non-stop work I created this one:

I learned a looot. One thing that used to confuse me was how do we start the generation process in the autoregressive architectures like the original transformer? Which is a detail none of the above blogs/papers covered.

It’s one of those “tribal knowledge” details. If you’re long enough in the field you know that stuff so it feels redundant to say it out loud.

And the answer is you trigger it with the start of the sentence (commonly denoted as <sos>) token’s embedding vector. Whatever comes out at the topmost layer you feed back as the second token and voila!

Super simple, as most things truly are, it’s just that rarely who has an incentive to make it easier for others, for whatever reason that may be.

I was also interested to see how the data is exactly fed to these transformers and how the exact training procedure looks like (KL divergence over soft targets in the case of the original transformer).

I got a much better feeling for different decoding methods like greedy, beam search, sampling, top-k, top-p (nucleus), and most importantly I understood the exact details of the attention mechanism and how to visualize it.

I had fun along the way, although it was also rough at times — I especially had problems with PyTorch’s torchtext and BucketIterator class specifically. Their torchvision lib for computer vision is far more sophisticated.

Anyways if you’d like to learn more I created a Jupyter Notebook here. It’s really long I know but once you understand it you’ll be in a good position to understand much of the exciting research that’s happening across the deep learning field.

I also did a video where I explained the process I went through developing this project:

How I developed the original transformers project

Finally, if you’d like to get a feeling for what you can do with transformers do check out HuggingFace’s official and community notebooks and try out their pipelines API— it’s super easy to use and you’ll get some toy projects under your belt.

My next steps

Photo by Jake Hills on Unsplash

Phew, that was a long blog! I, unfortunately, can’t afford (time-wise) to make more than one of these for every deep learning topic that I cover.

If you’ve read the first-ever Medium blog, that I wrote ~2 years ago, you could have expected me doing all of this. I mentioned that I am a strong believer in mixing theory with practical, hands-on projects.

So what are my next steps?

I’ll start exploring GNNs (Graph Neural Networks) starting from tomorrow pretty much! ❤️And after ~2 months I’ll switch my attention (pun intended) to RL (Reinforcement Learning).

You can definitely expect more YT videos during my exploration of these 2 subfields, at least 2 more (GitHub) projects, and 2 (Medium) blogs like this one.

And I’ll also try changing my DL framework a bit. One look at my GitHub repo and you know that I love PyTorch. I’ll try experimenting with JAX & GNNs.

After I finish that I’ll probably take one more month to cover a couple of topics that interest me: adversarial attacks, VAEs (variational auto-encoders), knowledge distillation and I want to build some foundations in symbolic AI.

That way I’ll (hopefully) have a holistic overview of the whole field.

Finally going longer into the future (longer into the future = 5+ months from now 😜), I’ll:

  • Shift my focus from reimplementing other people’s work (papers) to developing my own creative/funny/useful projects.
  • Start covering SOTA research papers on my YouTube channel as soon as they come out and I’ll also try to cover the most exciting stuff that’s happening/trending in the AI world be it some AI breakthroughs (like I recently covered DeepMind’s AlphaFold2), social topics or something else.
  • Still occasionally write on Medium

Also, there are some plans which I, unfortunately, can’t yet disclose — so stay tuned for those! ❤️

A personal update September, 2021: I recently landed a job at my dream company Google DeepMind! I’ll be working there as a Research Engineer starting December, 2021. That was one of those things I couldn’t disclose back then as I was still working at Microsoft.

A small tangent on the topic of Ph.D. as that’s something I often think about. I don’t have a Ph.D. in ML/deep learning as I didn’t want it. It just doesn’t align with my personal goals.

I think I have much more freedom like this but I’m also 100% aware that not everybody can work in a fully unsupervised manner.

People say Ph.D. gives you freedom. And that’s true — compared to undergrad/grad studies for sure. But I do believe it would be a “lesser freedom” to the thing I have. At least for me.

There are lots of tradeoffs of course and this could be an entirely separate blog post (Chris Olah wrote a beautiful blog on the topic).

Briefly in my opinion the top pros of doing a Ph.D. are:

  • Credentials (or “academic” badge as I like to call it) (oh he finished Stanford, oh he’s been to Cambridge, MIT, <fill in the blank>)
  • Networking with fellow researchers (although you definitely don’t need a Ph.D. to do this, I have a bunch of friends who are researchers at Microsoft, DeepMind, Stanford (just throwing fancy names around))

Top cons:

  • It takes a lot of your best years of life when you’re the most entrepreneurial you’ll ever be (most people at least)
  • No relevant experience in the industry, worse coding/engineering skills (unless you do an industrial Ph.D. that’s a nice tradeoff IMO)
  • Less freedom (again relative thing)
  • You tradeoff depth for breadth
  • No money (relatively speaking compared to the industry)

So IMO, If you’re certain that your goal in life is to become a researcher in a particular field then you should do a Ph.D. by all means — otherwise, you probably should not.

I just feel that many people (some of which I know) are doing a Ph.D. in deep learning not because they’re interested in pursuing a research career but quite the opposite actually.

They do it to get the highest paying jobs in the industry or to become more attractive to VCs (omg he’s been to Stanford). And that’s probably a good strategy, who am I to judge — but it’s not what Ph.D. should be about IMO, it’s just hacking/exploiting the current system.

If there is something you would like me to write about — write it down in the comment section or send me a DM, I’d be glad to write more about maths, ML, deep learning, software, landing a job in a big tech company, preparing for ML summer camps, electronics (I actually officially studied this beast), etc., anything that could help you.

Also feel free to drop me a message or:

  1. Connect and reach me on 💡 LinkedIn and Twitter
  2. Subscribe to my 🔔 YouTube channel for AI-related content️
  3. Follow me on 📚 Medium and 💻GitHub
  4. Subscribe to my 📢 monthly AI newsletter and join the 👨‍👩‍👧‍👦 Discord community!

And if you find the content I create useful consider becoming a Patreon!

Much love ❤️