当前位置：首页 > 编程日记 > 正文

nlp文本数据增强_如何使用Texthero为您的NLP项目准备基于文本的数据集

编程日记 2024-06-24 15:00:00

nlp文本数据增强

Natural Language Processing (NLP) is one of the most important fields of study and research in today’s world. It has many applications in the business sector such as chatbots, sentiment analysis, and document classification.Preprocessing and representing text is one of the trickiest and most annoying parts of working on an NLP project. Text-based datasets can be incredibly thorny and difficult to preprocess. But fortunately, the latest Python package called Texthero can help you solve these challenges.

自然语言处理(NLP)是当今世界上最重要的研究领域之一。它在商业领域有许多应用程序，例如聊天机器人，情感分析和文档分类。预处理和表示文本是NLP项目工作中最棘手，最烦人的部分之一。基于文本的数据集可能非常棘手且难以预处理。但是幸运的是，最新的Python软件包Texthero可以帮助您解决这些挑战。

什么是Texthero？ (What is Texthero?)

Texthero is a simple Python toolkit that helps you work with a text-based dataset. It provides quick and easy functionalities that let you preprocess, represent, map into vectors and visualize text data in just a couple of lines of code.

Texthero是一个简单的Python工具箱，可帮助您处理基于文本的数据集。它提供了快速简便的功能，使您只需几行代码即可进行预处理，表示，映射为向量并可视化文本数据。

Texthero is designed to be used on top of pandas, so it makes it easier to preprocess and analyze text-based Pandas Series or Dataframes.

Texthero被设计用于熊猫的顶部，因此可以更轻松地预处理和分析基于文本的Pandas系列或数据框。

If you are working on an NLP project, Texthero can help you get things done faster than before and gives you more time to focus on important tasks.

如果您正在从事NLP项目，Texthero可以帮助您比以前更快地完成工作，并给您更多时间专注于重要任务。

NOTE: The Texthero library is still in the beta version. You might face some bugs and pipelines might change. A faster and better version will be released and it will bring some major changes.

注意： Texthero库仍处于beta版本。您可能会遇到一些错误，并且管道可能会更改。将会发布更快更好的版本，它将带来一些重大变化。

Texthero概述 (Texthero Overview)

Texthero has four useful modules that handle different functionalities that you can apply in your text-based dataset.

Texthero具有四个有用的模块，这些模块处理可应用于基于文本的数据集中的不同功能。

Preprocessing
前处理
This module allows for the efficient pre-processing of text-based Pandas Series or DataFrames. It has different methods to clean your text dataset such as lowercase(), remove_html_tags() and remove_urls().
该模块允许对基于文本的Pandas Series或DataFrames进行高效的预处理。它有多种清理文本数据集的方法，例如小写()，remove_html_tags()和remove_urls()。
NLP
自然语言处理
This module has a few NLP tasks such as named_entities, noun_chunks, and so on.
该模块具有一些NLP任务，例如named_entities，名词_chunks等。
Representation
表示
This module has different algorithms to map words into vectors such as TF-IDF, GloVe, Principal Component Analysis(PCA), and term_frequency.
该模块具有不同的算法，可将单词映射到向量中，例如TF-IDF，GloVe，主成分分析(PCA)和term_frequency。
Visualization
可视化
The last module has three different methods to visualize the insights and statistics of a text-based Pandas DataFrame. It can plot a scatter plot and word cloud.
最后一个模块提供了三种不同的方法来可视化基于文本的Pandas DataFrame的见解和统计信息。它可以绘制散点图和词云。

安装Texthero (Install Texthero)

Texthero is free, open-source, and well documented. To install it open a terminal and execute the following command:

Texthero是免费的，开源的并且有据可查。要安装它，请打开一个终端并执行以下命令：

pip install texthero

The package uses a lot of other libraries on the back-end such as Gensim, SpaCy, scikit-learn, and NLTK. You don't need to install them all separately, pip will take care of that.

该软件包在后端使用了许多其他库，例如Gensim，SpaCy，scikit-learn和NLTK。您不需要将它们全部单独安装，pip会解决这一问题。

如何使用Texthero (How to use Texthero)

In this article I will use a news dataset to show you how you can use different methods provided by texthero's modules in your own NLP project.

在本文中，我将使用新闻数据集向您展示如何在自己的NLP项目中使用texthero模块提供的不同方法。

We will start by importing important Python packages that we are going to use.

我们将从导入将要使用的重要Python包开始。

#import important packagesimport texthero as hero
import pandas as pd

Then we'll load a dataset from the data directory. The dataset for this article focuses on news in the Swahili Language.

然后，我们将从数据目录中加载数据集。本文的数据集关注斯瓦希里语的新闻。

#load dataset data = pd.read_csv("data/swahili_news_dataset.csv")

Let's look at the top 5 rows of the dataset:

让我们看一下数据集的前5行：

# show top 5 rows data.head()

As you can see, in our dataset we have three columns (id, content, and category). For this article we will focus on the content feature.

如您所见，在我们的数据集中，我们有三列(id，content和category)。对于本文，我们将重点介绍内容功能。

# select news content only and show top 5 rowsnews_content = data[["content"]]
news_content.head()

We have created a new dataframe focused on content only, and then we'll show the top 5 rows.

我们创建了一个仅关注内容的新数据框，然后显示了前5行。

使用Texthero进行预处理。 (Preprocessing with Texthero.)

We can use the clean(). method to pre-process a text-based Pandas Series.

我们可以使用clean()。 预处理基于文本的Pandas系列的方法。

# clean the news content by using clean method from hero packagenews_content['clean_content'] = hero.clean(news_content['content'])

The clean() method runs seven functions when you pass a pandas series. These seven functions are:

传递熊猫系列时， clean()方法将运行七个函数。这七个功能是：

lowercase(s): Lowercases all text.
小写：所有文字均小写。
remove_diacritics(): Removes all accents from strings.
remove_diacritics()：从字符串中删除所有重音符号。
remove_stopwords(): Removes all stop words.
remove_stopwords()：删除所有停用词。
remove_digits(): Removes all blocks of digits.
remove_digits()：删除所有数字块。
remove_punctuation(): Removes all string.punctuation (!"#$%&'()*+,-./:;<=>?@[]^_`{|}~).
remove_punctuation()：删除所有字符串。标点符号(！“＃$％＆'()* +，-。/ :; <=>？@ [] ^ _`{|}〜)。
fillna(s): Replaces unassigned values with empty spaces.
fillna(s)：将未分配的值替换为空格。
remove_whitespace(): Removes all white space between words
remove_whitespace()：删除单词之间的所有空白

Now we can see the cleaned news content.

现在我们可以看到清理的新闻内容。

#show unclean and clean news contentnews_content.head()

定制清洗 (Custom Cleaning)

If the default pipeline from the clean() method does not fit your needs, you can create a custom pipeline with the list of functions that you want to apply in your dataset.

如果clean()方法的默认管道不符合您的需求，则可以使用要在数据集中应用的函数列表创建自定义管道。

As an example, I created a custom pipeline with only 5 functions to clean my dataset.

作为示例，我创建了一个仅包含5个函数的自定义管道来清理数据集。

#create custom pipeline
from texthero import preprocessingcustom_pipeline = [preprocessing.fillna,preprocessing.lowercase,preprocessing.remove_whitespace,preprocessing.remove_punctuation,preprocessing.remove_urls,]

Now I can use the custome_pipeline to clean my dataset.

现在，我可以使用custome_pipeline清理数据集。

#altearnative for custom pipelinenews_content['clean_custom_content'] = news_content['content'].pipe(hero.clean, custom_pipeline)

You can see the clean dataset we have created by using custom pipeline .

您可以看到我们使用自定义管道创建的干净数据集。

# show output of custome pipelinenews_content.clean_custom_content.head()

有用的预处理方法 (Useful preprocessing methods)

Here are some other useful functions from preprocessing modules that you can try to clean you text-based dataset.

这是预处理模块中的一些其他有用功能，您可以尝试清理基于文本的数据集。

删除数字 (Remove digits)

You can use the remove_digits() function to remove digits in your text-based datasets.

您可以使用remove_digits()函数来删除基于文本的数据集中的数字。

text = pd.Series("Hi my phone number is +255 711 111 111 call me at 09:00 am")
clean_text = hero.preprocessing.remove_digits(text)print(clean_text)

output: Hi my phone number is + call me at : am dtype: object

输出：您好我的电话号码是+打电话给我：am dtype：object

删除停用词 (Remove stopwords)

You can use the remove_stopwords() function to remove stopwords in your text-based datasets.

您可以使用remove_stopwords()函数删除基于文本的数据集中的停用词。

text = pd.Series("you need to know NLP to develop the chatbot that you desire")
clean_text = hero.remove_stopwords(text)print(clean_text)

output: need know NLP develop chatbot desire dtype: object

输出：需要了解NLP开发聊天机器人的愿望dtype：对象

删除网址 (Remove URLs)

You can use the remove_urls() function to remove links in your text-based datasets.

您可以使用remove_urls()函数来删除基于文本的数据集中的链接。

text = pd.Series("Go to https://www.freecodecamp.org/news/ to read more articles you like")
clean_text = hero.remove_urls(text)print(clean_text)

output: Go to to read more articles you like dtype: object

输出：转到喜欢的文章dtype：object

标记化 (Tokenize)

Tokenize each row of the given Pandas Series by using the tokenize() method and return a Pandas Series where each row contains a list of tokens.

使用tokenize()方法对给定的熊猫系列的每一行进行标记，并返回一个熊猫系列，其中每一行都包含令牌列表。

text = pd.Series(["You can think of Texthero as a tool to help you understand and work with text-based dataset. "])
clean_text = hero.tokenize(text)print(clean_text)

output: [You, can, think, of, Texthero, as, a, tool, to, help, you, understand, and, work, with, text, based, dataset] dtype: object

输出：[您可以想到Texthero，作为一种工具，以帮助您理解并使用基于文本的数据集] dtype：对象

删除HTML标签 (Remove HTML tags)

You can remove html tags from the given Pandas Series by using the remove_html_tags() method.

您可以使用remove_html_tags()方法从给定的Pandas系列中删除html标签。

text = pd.Series("<html><body><h2>hello world</h2></body></html>")
clean_text = hero.remove_html_tags(text)print(clean_text)

output: hello world dtype: object

输出：hello world dtype：对象

有用的可视化方法 (Useful visualization methods)

Texthero contains different method to visualize insights and statistics of a text-based Pandas DataFrame.

Texthero包含不同的方法来可视化基于文本的Pandas DataFrame的见解和统计信息。

词云 (Wordclouds)

The wordcloud() method from the visualization module plots an image using WordCloud from the word_cloud package.

可视化模块中的wordcloud()方法使用word_cloud包中的WordCloud绘制图像。

#Plot wordcloud image using WordCloud method
hero.wordcloud(news_content.clean_content, max_words=100,)

We passed the dataframe series and number of maximum words (for this example, it is 100 words) in the wordcloud() method.

我们在wordcloud()方法中传递了数据帧序列和最大字数(在本示例中为100个字)。

有用的表示方法 (Useful representation methods)

Texthero contains different methods from the representation module that help you map words into vectors using different algorithms such as TF-IDF, word2vec or GloVe. In this section I will show you how you can use these methods.

Texthero包含来自表示模块的不同方法，可帮助您使用TF-IDF，word2vec或GloVe等不同算法将单词映射为向量。在本节中，我将向您展示如何使用这些方法。

特遣部队 (TF-IDF)

You can represent a text-based Pandas Series using TF-IDF. I created a new pandas series with two pieces of news content and represented them in TF_IDF features by using the tfidf() method.

您可以使用TF-IDF表示基于文本的Pandas系列。我用两个新闻内容创建了一个新的熊猫系列，并使用tfidf ()方法将它们表示为TF_IDF功能。

# Create a new text-based Pandas Series.news = pd.Series(["mkuu wa mkoa wa tabora aggrey mwanri amesitisha likizo za viongozi wote mkoani humo kutekeleza maazimio ya jukwaa la fursa za biashara la mkoa huo", "serikali imetoa miezi sita kwa taasisi zote za umma ambazo hazitumii mfumo wa gepg katika ukusanyaji wa fedha kufanya hivyo na baada ya hapo itafanya ukaguzi na kuwawajibisha"])#convert into tfidf features 
hero.tfidf(news)

output: [0.187132760851739, 0.0, 0.187132760851739, 0.... [0.0, 0.18557550845969953, 0.0, 0.185575508459... dtype: object

输出：[0.187132760851739，0.0，0.187132760851739，0...。[0.0，0.18557550845969953，0.0，0.185575508459 ... dtype：object

NOTE: TF-IDF stands for term frequency-inverse document frequency.

注意： TF-IDF代表术语频率与文档频率成反比。

词频 (Term Frequency)

You can represent a text-based Pandas Series using the term_frequency() method. Term frequency (TF) is used to show how frequently an expression (term or word) occurs in a document or text content.

您可以使用term_frequency()方法表示基于文本的Pandas系列。术语频率(TF)用于显示表达式(术语或单词)在文档或文本内容中出现的频率。

news = pd.Series(["mkuu wa mkoa wa tabora aggrey mwanri amesitisha likizo za viongozi wote mkoani humo kutekeleza maazimio ya jukwaa la fursa za biashara la mkoa huo", "serikali imetoa miezi sita kwa taasisi zote za umma ambazo hazitumii mfumo wa gepg katika ukusanyaji wa fedha kufanya hivyo na baada ya hapo itafanya ukaguzi na kuwawajibisha"])# Represent a text-based Pandas Series using term_frequency.
hero.term_frequency(news)

output: [1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, ... [0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, ... dtype: object

输出：[1，0，1，0，1，0，1，0，0，0，0，1，1，0，0，... [0，1，0，1，0，1，0 ，1，1，1，1，0，0，1，1，... dtype：对象

K均值 (K-means)

Texthero can perform K-means clustering algorithm by using the kmeans() method. If you have an unlabeled text-based dataset, you can use this method to group content according to their similarities.

Texthero可以使用kmeans()方法执行K-means聚类算法。如果您有未标记的基于文本的数据集，则可以使用此方法根据内容的相似性对内容进行分组。

In this example, I will create a new pandas dataframe called news with the following columns content,tfidf and kmeans_labels.

在此示例中，我将创建一个名为news的新的pandas数据框，其中包含以下列content，tfidf和kmeans_labels。

column_names = ["content","tfidf", "kmeans_labels"]news = pd.DataFrame(columns = column_names)

We will use only the first 30 pieces of cleaned content from our news_content dataframe and cluster them into groups by using the kmeans() method.

我们将仅使用news_content数据帧中的前30个已清理内容，并使用kmeans()方法将它们聚类为组。

# collect 30 clean content.
news["content"] = news_content.clean_content[:30]# convert them into tf-idf features.
news['tfidf'] = (news['content'].pipe(hero.tfidf)
)# perform clustering algorithm by using kmeans() 
news['kmeans_labels'] = (news['tfidf'].pipe(hero.kmeans, n_clusters=5).astype(str)
)

In the above source code, in the pipeline of the k-means method we passed the number of clusters which is 5. This means we will group these contents into 5 groups.

在上面的源代码中，在k-means方法的流水线中，我们传递的簇数为5。这意味着我们将这些内容分为5组。

Now the selected news content has been labeled into five groups.

现在，所选新闻内容已被标记为五个组。

# show content and their labels
news[["content","kmeans_labels"]].head()

PCA (PCA)

You can also use the pca() method to perform principal component analysis on the given Pandas Series. Principal component analysis (PCA) is a technique for reducing the dimensionality of your datasets. This increases interpretability but at the same time minimizes information loss.

您也可以使用pca()方法对给定的Pandas系列进行主成分分析。 主成分分析 ( PCA )是一种用于降低数据集维数的技术。这增加了可解释性，但同时最大程度地减少了信息丢失。

In this example we use the tfidf features from the news dataframe and represent them into two components by using the pca() method. Finally we will show a scatterplot by using the scatterplot() method.

在此示例中，我们使用新闻数据帧中的tfidf功能，并使用pca()方法将它们表示为两个组件。最后，我们将使用scatterplot()方法显示一个散点图。

#perform pca
news['pca'] = news['tfidf'].pipe(hero.pca)#show scatterplot
hero.scatterplot(news, 'pca', color='kmeans_labels', title="news")

结语 (Wrap up)

In this article, you've learned the basics of how to use the Texthero toolkit Python package in your NLP project. You can learn more about the methods available in the documentation.

在本文中，您学习了如何在NLP项目中使用Texthero工具包Python包的基础知识。您可以在文档中了解有关可用方法的更多信息。

You can download the dataset and notebook used in this article here: https://github.com/Davisy/Texthero-Python-Toolkit .

您可以在此处下载本文中使用的数据集和笔记本： https : //github.com/Davisy/Texthero-Python-Toolkit 。

If you learned something new or enjoyed reading this article, please share it so that others can see it. Until then, see you in the next post! I can also be reached on Twitter @Davis_McDavid

如果您学习了新知识或喜欢阅读本文，请与他人分享，以便其他人可以看到。在那之前，在下一篇文章中见！也可以通过Twitter @Davis_McDavid与我联系

翻译自: https://www.freecodecamp.org/news/how-to-work-and-understand-text-based-dataset-with-texthero/