Categorizing Wikipedia Articles - Upwork

Būsena UŽDARYTA
Biudžetas 501-1000 Eur
Sukurta: 2019-02-27
Baigiasi: 2019-03-06
Siūlo: Nėra
Apibūdinimas: Greetings. I have accumulated a collection of around ~40,000 random uncategorized Wikipedia article's URLs.

I'd like to sort these URLs and assign them to its respected category.


I have established some general parent-categories which I feel the articles should fall under.


Architecture, Arts, Film and Music

Communication, Education and Literature

Companies and Organizations

Economics and Finance

Energy and Environment

Food and Drink

Geography and Places

Health and Medicine

Law and Politics

Mathematics

Media (Books, Movies and TV)

People

Philosophy, Religion and Spirituality

Psychology

Recreation and Sports

Science and Technology

Social Science (Anthropology, History and Sociology)


These are just the parent categories; each article should then be sorted by its sub-categories as well (Example: in Mathematics - Probability, Geometry, etc.; in Geography - Cities, National Parks, Islands, etc.; in Religion - Buddhism, Judaism, etc.; in Technology - Networking, AI, etc.; in People - Business, Sports, Politics, etc.)


The URLs are in a plain text format (.txt) and the output can be the same.


------------------------


Example


Uncategorized:


https://en.wikipedia.org/wiki/Alliteration

https://en.wikipedia.org/wiki/Authenticity_(philosophy)

https://en.wikipedia.org/wiki/Bull_spread

https://en.wikipedia.org/wiki/Convertibility

https://en.wikipedia.org/wiki/Currency_transaction_tax

https://en.wikipedia.org/wiki/Damien_Hirst

https://en.wikipedia.org/wiki/Dell

https://en.wikipedia.org/wiki/Didi_Chuxing

https://en.wikipedia.org/wiki/Endowment_policy

https://en.wikipedia.org/wiki/Envelope_journalism

https://en.wikipedia.org/wiki/Georg_Wilhelm_Friedrich_Hegel

https://en.wikipedia.org/wiki/Impermanence

https://en.wikipedia.org/wiki/John_Boehner

https://en.wikipedia.org/wiki/Malaria

https://en.wikipedia.org/wiki/Mark_Beaumont

https://en.wikipedia.org/wiki/Red_Rocks_Park

https://en.wikipedia.org/wiki/Robert_Kraft

https://en.wikipedia.org/wiki/Salar_de_Uyuni

https://en.wikipedia.org/wiki/Santosha

https://en.wikipedia.org/wiki/Sleep_hygiene

https://en.wikipedia.org/wiki/Stoicism

https://en.wikipedia.org/wiki/TD_Ameritrade

https://en.wikipedia.org/wiki/Vice_Media

https://en.wikipedia.org/wiki/Yosemite_Valley


------------------------


Categorized:


[Communication - Journalism]

https://en.wikipedia.org/wiki/Envelope_journalism


[Communication - Literature]

https://en.wikipedia.org/wiki/Alliteration



[Companies - Financial]

https://en.wikipedia.org/wiki/TD_Ameritrade


[Companies - Media]

https://en.wikipedia.org/wiki/Vice_Media


[Companies - Technology]

https://en.wikipedia.org/wiki/Dell


[Companies - Transport]

https://en.wikipedia.org/wiki/Didi_Chuxing



[Finance - Foreign Exchange]

https://en.wikipedia.org/wiki/Convertibility


[Finance - Insurance]

https://en.wikipedia.org/wiki/Endowment_policy


[Finance - Options]

https://en.wikipedia.org/wiki/Bull_spread


[Finance - Taxation]

https://en.wikipedia.org/wiki/Currency_transaction_tax



[Health - Diseases]

https://en.wikipedia.org/wiki/Malaria


[Health - Sleep]

https://en.wikipedia.org/wiki/Sleep_hygiene



[Geography - Parks]

https://en.wikipedia.org/wiki/Red_Rocks_Park


[Geography - Salt Flats]

https://en.wikipedia.org/wiki/Salar_de_Uyuni


[Geography - Valleys]

https://en.wikipedia.org/wiki/Yosemite_Valley



[People - Artist]

https://en.wikipedia.org/wiki/Damien_Hirst


[People - Businessmen]

https://en.wikipedia.org/wiki/Robert_Kraft


[People - Philosopher]

https://en.wikipedia.org/wiki/Georg_Wilhelm_Friedrich_Hegel


[People - Politics]

https://en.wikipedia.org/wiki/John_Boehner


[People - Sports]

https://en.wikipedia.org/wiki/Mark_Beaumont



[Philosophy]

https://en.wikipedia.org/wiki/Stoicism


[Philosophy - Concepts]

https://en.wikipedia.org/wiki/Authenticity_(philosophy)



[Religion - Buddhism]

https://en.wikipedia.org/wiki/Impermanence


[Religion - Hinduism]

https://en.wikipedia.org/wiki/Santosha


------------------------


The above example has to be applied to ~40,000 URLs. Avoiding over-categorization is a must. Strive to keep the sub-categories broad and most relevant.



Here is another example of a categorized set:


http://git.macropus.org/wikipedia-categories

www.github.com/hubgit/wikipedia-categories



While researching for ways to execute this task myself, I came across a few links which may be useful:


https://tools.wmflabs.org/mormegil/catsuggest

www.wikidata.org



I don't know what is the best approach to tackle this task, so kindly propose and demonstrate your method.


Kindly contact me if any further clarification is needed.


Thank you for your interest. Good day!

Budget: $400

Posted On: February 27, 2019 05:00 UTC
Category: Data Science & Analytics > Other - Data Science & Analytics

Skills: Data Entry, Data Mining, Data Scraping, Natural Language Processing, Wikipedia
click to apply

Darbo Tipas(ai):
  • PHP
  • CSS
Duomenų Bazė:
Operacinė Sistema: Linux
Siūlymų Skaičius: 0
Siūlosi Žinutės Kaina Trukmė Įvertinimas Informacija