Do you know Spacy? If not, then there is something must Fishy!
In NLP (natural language processing) , there are so many tools available in web to work with, but Spacy is one of its kind. It has some very useful tools to extract meaning informations from raw text.
spacy features new neural models for tagging, parsing and entity recognition, which has been constructed from very scratch.
How to install spacy:
pip install spacy or pip install -U spacy or pip install --user spacy (Depending upon your system settings)
After installation you need to download a language model. There are multiple language models available in Spacy, those are English, German , Spanish, Portuguese, Dutch, French , Italian and Multi-Language. We usually interested in English model and to download the code is:
python -m spacy download en
How to import spacy and the language model:
nlp = spacy.load('en')
Like other NLP and Text Processing Packages, Spacy can be used for following tasks:
Many other methods for cleaning and normalising text
From above mentioned tasks, I have found Entity Recognition and Dependency Parsing are the two very important tools which Spacy can offer and also very accurately.
In this discussion we will mostly discuss about these two topics.
Named-entity recognition or simply Entity Recognition is a subtask of information extraction that seeks to locate and classify atomic elements in text into predefined categories such as the names of persons, organizations, places, quantities, monetary values, percentages and more. Moreover it can be domain specific too.
Say, the passed string is,
s = """Hello, I am Atanu. I am from India. I work at Ayata as a Data Scientist. My salary is arround $260 per day."""
doc = nlp(s)
As you can see here, it can perfectly detect the sentences from the string.
Lets see, what are the items available for a token object from our doc item.
You can see for each of the token or word from the sentence list, there multiple number of method items like idx (index) , lemma_ (lemmatize version of the word), is_punct (is punctuation), pos_ (parts of speech) and many more.
For NER, the basic code looks like following,
As you can see, it can detect most of the entities properly. To understand more in detail about what each named entity means, you can refer to the documentation .
Note: Spacy has another beautiful library named displacy, which can render some beautiful visuals in jupyter while explaining the detected named entities for a string or a list of sentences.
This looks Nice ! RIght?
A dependency parser analyzes the grammatical structure of a sentence, establishing relationships between "head" words and words which modify those heads. The most widely used syntactic structure is the parse tree which can be generated using some parsing algorithms. These parse trees are useful in various applications like grammar checking or more importantly it plays a critical role in the semantic analysis stage.
Spacy dependency parser provides token properties to navigate the generated dependency parse tree. The generated parse tree follows all the properties of a tree and each child token has only one head token although a head token can have multiple children.
Say the passed string or sentence is 'I work at Ayata as DataScientist'. The dependency of each word on its head word, somewhat looks like following,
As you can see here the root token or word is WORK, which is itself a verb.
Below is the rendered version of the dependency tree.
In near future if we get anyscope, we will discuss other details of Spacy and will try to explore some other different features.
For more reading and knowledge: Please checkout the official website link for Spacy