Inside the Pipeline of Daily-Digest: Transforming News into Bite-Sized Summaries

In today's fast-paced world, staying updated with the latest news can be overwhelming. With an abundance of information available across various platforms, it's challenging to sift through the noise and focus on what truly matters. At Daily-Digest, we streamline the news consumption process by delivering concise and informative digests twice a day. Let's take a deep dive into the pipeline that powers Daily-Digest's operations. Quite in fact, let's start with the basic building blocks.

The Basics

Daily-Digest is built using Node.js applications and a PostgreSQL database. Every component listed in this blog post runs on Node.js and interacts with this database in some manner. We use Node.js as it fits the architecture of Daily-Digest rather well: multiple components/microservices, each connecting to other web services, such as the database, working with (JSON) objects. It's fast to build and run, and the library of node modules at our disposal is a nice plus.

For the sake of management and monitoring, making use of Node.js also allows us to use pm2, ensuring availability of the pipeline. We covered our usage of pm2 in this blog post.

As for the PostgreSQL database, we make use of a Supabase project, which offers a fully manageable PostgreSQL database out of the box. We went with Supabase for this reason of easy access. Additionally, the usage monitoring of Supabase works rather well for us. We generate news articles twice a day, over a set of languages and topics. This means that we have very precise control over how much we can tax the database. As we do not have any spiky traffic that needs to be handled, we can make detailed decisions of how much news we actually want to write out, without breaking the bank.

Like any other news outlet, various roles are at play at Daily-Digest. These roles are taken on by the components powering the pipeline:

1. Extractor: The Researcher

The journey begins with the Extractor component, tasked with sourcing news articles from a plethora of sources. Leveraging advanced web scraping techniques, the Extractor scours the internet for relevant content across multiple topics. To do so, we use puppeteer for the purpose of loading news content from the sources we monitor.

Additionally, we created a selector framework to build JavaScript selectors based on news websites we often use as sources. This allows very precise retrieval of the actual news content from the news websites (often overloaded with other unnecessary information).

Once retrieved, it cleans up the data, ensuring that only the crux of the news is retained. We remove any excess of information, for example teaser headlines for other news, advertisements, or unrelated meta information, and make sure to only use the main thing you are interested in: the news.

By extracting articles from multiple distinct sources per news, the Extractor lays the groundwork for comprehensive coverage. We do so for multiple news for a variety of topics, such as sports, technology, and of course the top news of the day.

We store one record in our PostgreSQL database per topic and language, twice a day. For each, we store a JSON array, containing multiple other arrays with the news content of multiple articles covering the same news piece. We store our sources along with it in an object of the same structure.

2. Aggregator: The Writer

With a treasure trove of news articles at its disposal, the Aggregator steps in to distill the essence of each story. Harnessing the power of Large Language Models (LLMs), the Aggregator crafts succinct summaries that capture the key highlights and nuances of the original articles.

Additionally, it crafts compelling headlines and generates hashtags tailored for social media platforms. If you go to our main page right now, these are the headlines you see first.

We make use of various LLMs at Daily-Digest, for example, we utilize Ollama. You can check out our blog post on Ollama setup here.

We made it a key point to not just summarise individual articles, but multiple, all covering the same news information. This way, we can make sure to report on the “intersection” of news, the things all news sources can agree on, which is the essence of what is to be reported. We will take a deeper dive into the prompt engineering we performed in another post.

Again, we store this result per topic and language to our database, as an array of news objects, with headline, content, and hashtags.

3. Synthesizer: The News Anchor

In the Synthesizer component, the focus shifts from text to audio as the synthesized content takes center stage. Utilizing Text-to-Speech (TTS) technology, the Synthesizer transforms the summarized articles into audio files reminiscent of a professional news broadcast.

By incorporating Speech Synthesis Markup Language (SSML), it infuses the spoken language with natural intonations and cadences, enhancing the listener's experience. This way, news is not just turned into the flat TTS voice already present in early operating systems, but into clear, human-like language that you can actually listen to.

Furthermore, the Synthesizer creates a dynamic news site featuring the article of the day alongside playable audio clips, offering users a multi-sensory news experience. Quite in fact, this is the site you are currently reading this blog post on!

We keep our generated content indefinitely, however, for the sake of storage, we remove audio files that are older than seven days. The newer ones are stored right next to the webpage files served by our server, ready to be listened to.

4. Social: The Publisher

The final part of the pipeline is the Social component, responsible for amplifying Daily-Digest's reach across various social networking platforms. Tailoring the presentation to each platform's unique characteristics, the Social module disseminates the top news of the day to LinkedIn, X, and Instagram.

While LinkedIn receives full-fledged posts with readable headlines, X and Instagram opt for concise captions augmented by relevant hashtags, with one-minute videos showcasing the day's top news, complete with the audio narration sourced from the Synthesizer component.

We noticed that the various social media platforms tend to differ greatly when it comes to publishing via an API. While for instance LinkedIn lets you post content after authenticating via OAuth, Instagram only allows you to store your content as a container, after which you have to wait for the content to finish processing. We will jump into more details on social media publishing at another time.

In any case, all these platforms require authentication with your respective account. Hence, we have built authentication endpoints into the Social component, which forward you to the platform in question, and, after login, back to the Social component, to store access tokens, refresh tokens, or other retrieved authentication credentials for later usage. We retrieve these when posting every day, and store the expiration time along them, to notify us if we need to reauthenticate.

Conclusion

Putting all these together, you end up with bite-size news that are ready to read and listen, twice a day, with the most important information at your disposal.

In essence, Daily-Digest's pipeline represents a fusion of technology and journalism, contributing to the way we consume news in the digital age. By streamlining the news curation process and enhancing accessibility across various mediums, we want you to stay informed, in a way that is easy to digest.