Syntax COVID-19 Analysis

The syntax analysis included 75889 abstracts from 100754 published articles. Last update in 2021-03-01 by LitCovid. Most of them were Journal Article (59.39%), Journal Article-Review (8.83%), Letter (6.8%), Journal Article-Research Support, Non-U.S. Gov’t (3.86%) and Editorial (3.28%).

Graphic 1. Daily (red) and cumulative (green) publication about COVID-19


United States (16.2%), China (8.07%), United Kingdom (7.28%), Italy (5.94%) and India (5%) were the main source of scientific literature. About 42.49% of articles analyzed came from these five countries.

Map 1. COVID-19 literature source. This map was generated by tmap (version 3.0) and sp (version 1.4-1) R packages


PlatCOVID performed 4 descriptive syntax analysis in these abstracts:
(1) Word atomization all abstracts
(2) Categorization based on word atomization
(3) Word atomization of each category
(4) Sentece atomization and Human Literature Curation



Word Atomization Overview

Using the atomization process, 224033 words/terms were found. 28220 commom words were execluded, remaining 195813 words. The table bellow shows the top 10 terms. All words are availible at supplemantary informations.


Box 1. Top 10 Words cited in Abstracts in COVID-19 literature.

Word Frequency
pandemic 54183
disease 49770
health 44194
during 36519
study 33816
infection 32435
clinical 30975
care 29374
severe 28116
respiratory 26438


Our analysis suggests that the scientific focus, until now, has been to summarize the main clinical symptoms of COVID-19 (terms: respiratory, clinical, severe, acute, pneumonia, syndrome and symptoms, fever, chest and lung). It is also possible to infer that many articles were driven to describe the virus spreading (terms: novel, severe, virus, outbreak epidemic and spread). The other scientific efforts discussed were about the transmission, prevention, treatment, health care management and diagnosis of SARS-CoV-2 and COVID-19.



Categorization Process: The 5 classes of Science Interest

Based on global words tokenization/atomization from abstracts, we categorized the COVID-19 studies in five categories: (1) clinical & signs & symptoms, (2) epidemiology, (3) transmission, (4) treatment and (5) diagnosis (Fluxogram 1). The categorization process used the Mesh and DeCS terms list.

Fluxogram 1. Workflow of categorization. Click on the square to follow the information.

65 articles fit all categories. The articles acess on PMIDs: 32112886, 32278065, 32317810, 32347772, 32362969, 32397688, 32447742, 32499983, 32603887, 32605194, 32605661, 32623083, 32636542, 32811406, 32840614, 32881628, 32957928, 32989413, 33014150, 33014984, 33175702, 33186230, 33199136, 33374759, 33442244, 32145185, 32183901, 32185921, 32220177, 32228809, 32271601, 32300673, 32357503, 32442265, 32442720, 32475877, 32498762, 32506768, 32532933, 32534188, 32565599, 32584236, 32591667, 32641059, 32647672, 32679582, 32702935, 32729367, 32730095, 32754600, 32764417, 32773409, 32774008, 32790891, 32934940, 33005276, 33062082, 33080715, 33240881, 33363098, 33490198, 33493922, 33537362, 32297723, 33318893.

Venn 1. Categorizations of abstracts.



Word Atomization of Categories

Then, we peformed the words atomization from abstracts of each categories. Acess to view all words atomization report in each category.

Box 2. Top 10 Words/terms atomization of each category.

Diagnose (n) Treatment (n) Epidemiology (n) Transmission (n) Signs (n)
disease (8280) treatment (20574) disease (4055) transmission (10980) disease (35523)
diagnosis (7701) disease (17844) clinical (3032) disease (6306) pandemic (33958)
clinical (6654) pandemic (14019) epidemiological (2852) pandemic (5586) clinical (30890)
pandemic (6233) clinical (13684) health (2763) infection (5386) health (28256)
infection (5770) severe (11375) infection (2688) health (5226) study (24810)
study (4948) care (11135) pandemic (2579) during (4266) during (23741)
respiratory (4612) infection (11104) severe (2069) respiratory (4089) infection (23191)
health (4468) respiratory (9839) respiratory (2043) virus (3847) severe (20941)
severe (4417) health (9510) study (1992) risk (3820) care (20286)
during (4373) during (9421) risk (1700) study (3345) respiratory (19519)

Sentece Tokenization

Finally, we peformed the tokenization sentece process from abstracts of each categories.
Frist, we colect the last 4 sentence of each abstract, assumed as the conclusion of the work, using pubmed.mineR. Around 3.03556^{5} conclusion sentences were achivied.
Second, we extract the sentece context of each category term, previously used, by tokenizer. 8542, 16570, 1366, 6844 and 24056 senteces were retrivied, about diagnosis, treatment, epidemiology, transmission and clinical, sings and symptoms, respectivelly. Articles with no context sentence were excluded.
Third, we began the human curation process (Fluxogram 2):

Fluxogram 2. Human curation process from PlatCOVID based on 5 categories. Click on the square to follow the information.