How we did it: upgraded tags system for daily.dev
Over the past few months daily.dev has been scaling and growing super fast. This growth has brought several challenges with it. Most of the challenges are related to our content delivery system.
Today we introduce new, revamped tags system for daily.dev! This release should make it easier to:
- Personalize your feed
- Stay updated on particular topics
- Discover high-quality content
What went wrong 🧯
Ever since daily.dev started sourcing content we relied on tags provided by the RSS feeds of our content sources. Those tags are usually being set by the authors or the editors of each article.
The problem with this method started to appear as we scaled in several forms:
- Authors and editors want to optimize for SEO and reach. Generally, that's fine. However, we've been witnessing many cases where the tags had nothing to do with the article's content in our context.
- Tags were not standardized in terms of synonyms. That means that authors would choose their way to set the tags for their posts according to their convenience. For instance: some authors would tag an article with "vue" while others would use "vuejs".
- Niche tags were super hard to discover since they didn't have enough content associated with them, for example, content about a specific technology within AWS. We realized that most readers would be interested in generalized tags.
All of the above resulted in a tags list of more than 80K tags 😱 Therefore it was (nearly) impossible to filer your feed effectively without missing valuable content.
How did we solve it? 🕹
We understood that we should implement a serious change here to support the current scale. This is what we did:
Step 1: Decided to build an auto-tagging system
We decided to create a proprietary auto-tagging system. This time, we wanted the tagging to be based on the post's actual content, regardless of the tags provided in each source's RSS feed.
Step 2: Experimented with several NLP models
We researched and experimented with several open-source NLP models that would be able to extract the most dominant keywords based on the headlines and content of each post.
Step 3: Optimized the chosen NLP model
We optimized the chosen NLP model for our use-case to:
- Limit the number of tags per article.
- Tags can be a single word or multiple words.
Step 4: Retroactively re-tagged our entire articles DB
We then ran the model on our entire article database from the past 3 years. That resulted in 35K keywords we now had to filter. You can imagine our facial expressions when we saw it and realized how much work is ahead of us 😅
Step 5: Developed a back-office moderation web app
We developed a back-office web app to help us moderate the tags. We then sorted the tags by occurrences (the number of times the model has detected a tag). The web app contains the tag name, its occurrences count, and several sample posts that got this tag by the model.
Here's how the web app looks like:
Step 6: Reviewed the new tags list manually
We then started a manual job and reviewed each tag. This process took about two weeks! During those two weeks, we marked each tag with one of three options:
- Allow: Meaning that users would be able to use this tag to filer their feed.
- Deny: Meaning that this tag is irrelevant. However, an article might have both a rejected tag and an approved tag. In such a case it will be shown under the approved tag. In other words, "rejection" doesn't mean we will "block" this content from appearing in daily.dev (that's another problem we plan to solve later).
- Synonym: We first decided on conventions. Based on that, we could easily decide if the tag we see is an "approved" tag or if it's a synonym of an approved tag. If a tag was marked as a synonym it means that our system would re-tag it to the approved tag it was associated with. For example, an article that got the tag "reactjs" would be re-tagged as "react".
Lesson: Manually reviewing over 35K tags thought us so much about the content we have in daily.dev. It gave insights that we couldn't learn any other way. Sometimes, doing a manual job is the best choice for product owners to actually know what they have in their hands.
This process resulted in a little over 200 high-quality tags.
Step 7: Shipped to production
We then took the new tags list and pushed it to production. So what are you waiting for? Go ahead and explore it for yourself:
The new revamped tags list 👀
Sorted from A->Z. Most popular tags are marked in color.
How to filter by tags?
Here you go:
- We had serious content issues, primarily resulting in a useless list of tags.
- We created a new auto-tagging system to deliver an easier experience to discover high-quality content.
- Technology can bring you far, but sometimes you should get your hand dirty and do some manual job.
If you are busy or lazy it's ok, try our weekly recap and we'll save your time