Categorizing Wikipedia Articles - Upwork
Būsena | UŽDARYTA |
Biudžetas | 501-1000 Eur |
Sukurta: | 2019-02-27 |
Baigiasi: | 2019-03-06 |
Siūlo: | Nėra |
Apibūdinimas: | Greetings. I have accumulated a collection of around ~40,000 random uncategorized Wikipedia article's URLs. I'd like to sort these URLs and assign them to its respected category. I have established some general parent-categories which I feel the articles should fall under. Architecture, Arts, Film and Music Communication, Education and Literature Companies and Organizations Economics and Finance Energy and Environment Food and Drink Geography and Places Health and Medicine Law and Politics Mathematics Media (Books, Movies and TV) People Philosophy, Religion and Spirituality Psychology Recreation and Sports Science and Technology Social Science (Anthropology, History and Sociology) These are just the parent categories; each article should then be sorted by its sub-categories as well (Example: in Mathematics - Probability, Geometry, etc.; in Geography - Cities, National Parks, Islands, etc.; in Religion - Buddhism, Judaism, etc.; in Technology - Networking, AI, etc.; in People - Business, Sports, Politics, etc.) The URLs are in a plain text format (.txt) and the output can be the same. ------------------------ Example Uncategorized: https://en.wikipedia.org/wiki/Alliteration https://en.wikipedia.org/wiki/Authenticity_(philosophy) https://en.wikipedia.org/wiki/Bull_spread https://en.wikipedia.org/wiki/Convertibility https://en.wikipedia.org/wiki/Currency_transaction_tax https://en.wikipedia.org/wiki/Damien_Hirst https://en.wikipedia.org/wiki/Dell https://en.wikipedia.org/wiki/Didi_Chuxing https://en.wikipedia.org/wiki/Endowment_policy https://en.wikipedia.org/wiki/Envelope_journalism https://en.wikipedia.org/wiki/Georg_Wilhelm_Friedrich_Hegel https://en.wikipedia.org/wiki/Impermanence https://en.wikipedia.org/wiki/John_Boehner https://en.wikipedia.org/wiki/Malaria https://en.wikipedia.org/wiki/Mark_Beaumont https://en.wikipedia.org/wiki/Red_Rocks_Park https://en.wikipedia.org/wiki/Robert_Kraft https://en.wikipedia.org/wiki/Salar_de_Uyuni https://en.wikipedia.org/wiki/Santosha https://en.wikipedia.org/wiki/Sleep_hygiene https://en.wikipedia.org/wiki/Stoicism https://en.wikipedia.org/wiki/TD_Ameritrade https://en.wikipedia.org/wiki/Vice_Media https://en.wikipedia.org/wiki/Yosemite_Valley ------------------------ Categorized: [Communication - Journalism] https://en.wikipedia.org/wiki/Envelope_journalism [Communication - Literature] https://en.wikipedia.org/wiki/Alliteration [Companies - Financial] https://en.wikipedia.org/wiki/TD_Ameritrade [Companies - Media] https://en.wikipedia.org/wiki/Vice_Media [Companies - Technology] https://en.wikipedia.org/wiki/Dell [Companies - Transport] https://en.wikipedia.org/wiki/Didi_Chuxing [Finance - Foreign Exchange] https://en.wikipedia.org/wiki/Convertibility [Finance - Insurance] https://en.wikipedia.org/wiki/Endowment_policy [Finance - Options] https://en.wikipedia.org/wiki/Bull_spread [Finance - Taxation] https://en.wikipedia.org/wiki/Currency_transaction_tax [Health - Diseases] https://en.wikipedia.org/wiki/Malaria [Health - Sleep] https://en.wikipedia.org/wiki/Sleep_hygiene [Geography - Parks] https://en.wikipedia.org/wiki/Red_Rocks_Park [Geography - Salt Flats] https://en.wikipedia.org/wiki/Salar_de_Uyuni [Geography - Valleys] https://en.wikipedia.org/wiki/Yosemite_Valley [People - Artist] https://en.wikipedia.org/wiki/Damien_Hirst [People - Businessmen] https://en.wikipedia.org/wiki/Robert_Kraft [People - Philosopher] https://en.wikipedia.org/wiki/Georg_Wilhelm_Friedrich_Hegel [People - Politics] https://en.wikipedia.org/wiki/John_Boehner [People - Sports] https://en.wikipedia.org/wiki/Mark_Beaumont [Philosophy] https://en.wikipedia.org/wiki/Stoicism [Philosophy - Concepts] https://en.wikipedia.org/wiki/Authenticity_(philosophy) [Religion - Buddhism] https://en.wikipedia.org/wiki/Impermanence [Religion - Hinduism] https://en.wikipedia.org/wiki/Santosha ------------------------ The above example has to be applied to ~40,000 URLs. Avoiding over-categorization is a must. Strive to keep the sub-categories broad and most relevant. Here is another example of a categorized set: http://git.macropus.org/wikipedia-categories www.github.com/hubgit/wikipedia-categories While researching for ways to execute this task myself, I came across a few links which may be useful: https://tools.wmflabs.org/mormegil/catsuggest www.wikidata.org I don't know what is the best approach to tackle this task, so kindly propose and demonstrate your method. Kindly contact me if any further clarification is needed. Thank you for your interest. Good day! Budget: $400 Posted On: February 27, 2019 05:00 UTC Category: Data Science & Analytics > Other - Data Science & Analytics Skills: Data Entry, Data Mining, Data Scraping, Natural Language Processing, Wikipedia click to apply |
Darbo Tipas(ai): |
|
Duomenų Bazė: | |
Operacinė Sistema: | Linux |
Siūlymų Skaičius: | 0 |